Best practices for long-running Ray clusters with extremely high task throughput - GCS metadata accumulation causing scheduling delays

g199209 · November 28, 2025, 8:29am

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.51.1
Python version: 3.12
OS: Rocky Linux 8.8
Cloud/Infrastructure: N/A
Other libs/tools (if relevant): N/A

3. What happened vs. what you expected:

We are running a large, long-running Ray cluster for production workloads with extremely high task throughput. After approximately 7-10 days of continuous operation, we observe significant scheduling delays for new tasks.

Symptoms:
- Tasks stay in PENDING_NODE_ASSIGNMENT state for 10-40 minutes before being scheduled
- Cluster resources show plenty of availability (only ~2% CPU utilized out of 5000+ cores)
- gcs_server process consumes 85-90% CPU constantly
- Task metadata count grows to 100+ million entries within days
Expected behavior:
- Tasks should be scheduled within seconds when resources are available
- GCS should not become a bottleneck for scheduling

Workload characteristics

This is not a bug per se, but rather a question about architecture and best practices for our use case:

Task volume: Billions of small tasks per week
Job volume: Tens of thousands of jobs per week
Task duration: Mostly short-lived (seconds to a few minutes)
Task resources: Typically 1 CPU, 2-3GB memory per task
Cluster uptime goal: Continuous operation for weeks/months or even years
Pattern: Heavy use of ray.remote functions with task chaining (ObjectRefs passed between tasks)

Diagnostic information

# ray status shows low utilization but tasks won’t schedule

Total Usage:

  113.3/5152.0 CPU

  0.0/14.0 GPU

  154.88GiB/5.97TiB memory

# GCS server is CPU-bound (single-threaded bottleneck)

$ top -b -n 1 | grep gcs_server

PID   USER  %CPU  %MEM   TIME+      COMMAND

xxxxx root  87.5  5.0    11143:02   gcs_server

# Task count in cluster after \~7 days

Total tasks retrieved: 119,708,734 (truncated due to size)

# Total jobs accumulated

Total jobs: 6,368

# Raylet state shows no pending work in queue

Schedule queue length: 0

Number of spilled unschedulable leases: 0

Questions for the community

Is Ray designed to handle this scale of task throughput in a long-running cluster?

Billions of tasks per week seems to exceed GCS’s capacity to manage metadata
Is there a documented upper limit for task throughput in a single cluster?

What is the recommended approach for our use case?

Periodic cluster restart (daily? weekly?)
Multiple smaller clusters with job routing?
External task queue with Ray as execution backend only?

Which environment variables should we tune?

RAY_task_events_max_num_task_in_gcs - what’s the max safe value?
Are there other GCS garbage collection parameters?
Can we completely disable task event recording if we don’t need the dashboard history?

Is there a way to manually trigger cleanup of finished task/job metadata without restarting the cluster?
Future roadmap question: Are there plans to improve GCS scalability (e.g., sharding, distributed GCS, more aggressive garbage collection)?

Additional context

We chose Ray over alternatives (Dask, Celery, etc.) because of its superior performance with high task counts. However, we’re hitting a wall with long-term cluster operation. We would appreciate any guidance on operating Ray at this scale.

Thank you!

RunLLM · November 28, 2025, 8:29am

Ray’s GCS is not currently designed for indefinite, ultra-high-throughput, long-running clusters with 100M+ task metadata entries; GCS becomes a bottleneck due to metadata accumulation, leading to high CPU usage and severe scheduling delays, as you observed. There is no documented hard upper limit, but practical experience and community reports show that clusters with billions of tasks per week will eventually hit GCS scalability and garbage collection limits, especially since GCS is single-threaded and task/job metadata is not aggressively GC’d by default.

Recommended approaches include: (1) periodic cluster restarts to clear metadata, (2) using multiple smaller clusters with job routing, and (3) tuning environment variables such as RAY_task_events_max_num_task_in_gcs (default 100,000; setting higher increases memory/CPU pressure—see ray_config_def.h), and disabling task event recording by setting task_events_report_interval_ms=0 if dashboard history is not needed (source). There is currently no supported way to manually trigger metadata cleanup without a restart, and GCS sharding/distributed GCS is not yet available, though there is interest in future improvements. For now, regular restarts and/or cluster partitioning are the only robust solutions at your scale.

Would you like a detailed breakdown of which config parameters to set and how to automate cluster restarts?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Confusion around Ray Core task limit Ray Core	3	234	March 13, 2025
Pending Ray Jobs crashing ray cluster Ray Clusters	1	102	December 2, 2025
How to get gcs server momery distribution to debug memory continued increasement? Ray Core	5	506	April 25, 2023
Ray cluster deadlocked after drive full Ray Clusters	11	95	December 2, 2025
Gcs_server takes almost 100% cpu even though there's no running task Ray Core	14	1155	June 15, 2022

Best practices for long-running Ray clusters with extremely high task throughput - GCS metadata accumulation causing scheduling delays

Workload characteristics

Diagnostic information

Questions for the community

Additional context

Related topics