1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.51.1
- Python version: 3.12
- OS: Rocky Linux 8.8
- Cloud/Infrastructure: N/A
- Other libs/tools (if relevant): N/A
3. What happened vs. what you expected:
We are running a large, long-running Ray cluster for production workloads with extremely high task throughput. After approximately 7-10 days of continuous operation, we observe significant scheduling delays for new tasks.
-
Symptoms:
-
Tasks stay in PENDING_NODE_ASSIGNMENT state for 10-40 minutes before being scheduled
-
Cluster resources show plenty of availability (only ~2% CPU utilized out of 5000+ cores)
-
gcs_server process consumes 85-90% CPU constantly
-
Task metadata count grows to 100+ million entries within days
-
-
Expected behavior:
-
Tasks should be scheduled within seconds when resources are available
-
GCS should not become a bottleneck for scheduling
-
Workload characteristics
This is not a bug per se, but rather a question about architecture and best practices for our use case:
-
Task volume: Billions of small tasks per week
-
Job volume: Tens of thousands of jobs per week
-
Task duration: Mostly short-lived (seconds to a few minutes)
-
Task resources: Typically 1 CPU, 2-3GB memory per task
-
Cluster uptime goal: Continuous operation for weeks/months or even years
-
Pattern: Heavy use of ray.remote functions with task chaining (ObjectRefs passed between tasks)
Diagnostic information
# ray status shows low utilization but tasks won’t schedule
Total Usage:
113.3/5152.0 CPU
0.0/14.0 GPU
154.88GiB/5.97TiB memory
# GCS server is CPU-bound (single-threaded bottleneck)
$ top -b -n 1 | grep gcs_server
PID USER %CPU %MEM TIME+ COMMAND
xxxxx root 87.5 5.0 11143:02 gcs_server
# Task count in cluster after \~7 days
Total tasks retrieved: 119,708,734 (truncated due to size)
# Total jobs accumulated
Total jobs: 6,368
# Raylet state shows no pending work in queue
Schedule queue length: 0
Number of spilled unschedulable leases: 0
Questions for the community
- Is Ray designed to handle this scale of task throughput in a long-running cluster?
-
Billions of tasks per week seems to exceed GCS’s capacity to manage metadata
-
Is there a documented upper limit for task throughput in a single cluster?
- What is the recommended approach for our use case?
-
Periodic cluster restart (daily? weekly?)
-
Multiple smaller clusters with job routing?
-
External task queue with Ray as execution backend only?
- Which environment variables should we tune?
-
RAY_task_events_max_num_task_in_gcs- what’s the max safe value? -
Are there other GCS garbage collection parameters?
-
Can we completely disable task event recording if we don’t need the dashboard history?
-
Is there a way to manually trigger cleanup of finished task/job metadata without restarting the cluster?
-
Future roadmap question: Are there plans to improve GCS scalability (e.g., sharding, distributed GCS, more aggressive garbage collection)?
Additional context
We chose Ray over alternatives (Dask, Celery, etc.) because of its superior performance with high task counts. However, we’re hitting a wall with long-term cluster operation. We would appreciate any guidance on operating Ray at this scale.
Thank you!