Best practices for long-running Ray clusters with extremely high task throughput - GCS metadata accumulation causing scheduling delays

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.51.1
  • Python version: 3.12
  • OS: Rocky Linux 8.8
  • Cloud/Infrastructure: N/A
  • Other libs/tools (if relevant): N/A

3. What happened vs. what you expected:

We are running a large, long-running Ray cluster for production workloads with extremely high task throughput. After approximately 7-10 days of continuous operation, we observe significant scheduling delays for new tasks.

  • Symptoms:

    • Tasks stay in PENDING_NODE_ASSIGNMENT state for 10-40 minutes before being scheduled

    • Cluster resources show plenty of availability (only ~2% CPU utilized out of 5000+ cores)

    • gcs_server process consumes 85-90% CPU constantly

    • Task metadata count grows to 100+ million entries within days

  • Expected behavior:

    • Tasks should be scheduled within seconds when resources are available

    • GCS should not become a bottleneck for scheduling

Workload characteristics

This is not a bug per se, but rather a question about architecture and best practices for our use case:

  • Task volume: Billions of small tasks per week

  • Job volume: Tens of thousands of jobs per week

  • Task duration: Mostly short-lived (seconds to a few minutes)

  • Task resources: Typically 1 CPU, 2-3GB memory per task

  • Cluster uptime goal: Continuous operation for weeks/months or even years

  • Pattern: Heavy use of ray.remote functions with task chaining (ObjectRefs passed between tasks)

Diagnostic information

# ray status shows low utilization but tasks won’t schedule

Total Usage:

  113.3/5152.0 CPU

  0.0/14.0 GPU

  154.88GiB/5.97TiB memory

# GCS server is CPU-bound (single-threaded bottleneck)

$ top -b -n 1 | grep gcs_server

PID   USER  %CPU  %MEM   TIME+      COMMAND

xxxxx root  87.5  5.0    11143:02   gcs_server

# Task count in cluster after \~7 days

Total tasks retrieved: 119,708,734 (truncated due to size)

# Total jobs accumulated

Total jobs: 6,368

# Raylet state shows no pending work in queue

Schedule queue length: 0

Number of spilled unschedulable leases: 0

Questions for the community

  1. Is Ray designed to handle this scale of task throughput in a long-running cluster?
  • Billions of tasks per week seems to exceed GCS’s capacity to manage metadata

  • Is there a documented upper limit for task throughput in a single cluster?

  1. What is the recommended approach for our use case?
  • Periodic cluster restart (daily? weekly?)

  • Multiple smaller clusters with job routing?

  • External task queue with Ray as execution backend only?

  1. Which environment variables should we tune?
  • RAY_task_events_max_num_task_in_gcs - what’s the max safe value?

  • Are there other GCS garbage collection parameters?

  • Can we completely disable task event recording if we don’t need the dashboard history?

  1. Is there a way to manually trigger cleanup of finished task/job metadata without restarting the cluster?

  2. Future roadmap question: Are there plans to improve GCS scalability (e.g., sharding, distributed GCS, more aggressive garbage collection)?

Additional context

We chose Ray over alternatives (Dask, Celery, etc.) because of its superior performance with high task counts. However, we’re hitting a wall with long-term cluster operation. We would appreciate any guidance on operating Ray at this scale.

Thank you!

Ray’s GCS is not currently designed for indefinite, ultra-high-throughput, long-running clusters with 100M+ task metadata entries; GCS becomes a bottleneck due to metadata accumulation, leading to high CPU usage and severe scheduling delays, as you observed. There is no documented hard upper limit, but practical experience and community reports show that clusters with billions of tasks per week will eventually hit GCS scalability and garbage collection limits, especially since GCS is single-threaded and task/job metadata is not aggressively GC’d by default.

Recommended approaches include: (1) periodic cluster restarts to clear metadata, (2) using multiple smaller clusters with job routing, and (3) tuning environment variables such as RAY_task_events_max_num_task_in_gcs (default 100,000; setting higher increases memory/CPU pressure—see ray_config_def.h), and disabling task event recording by setting task_events_report_interval_ms=0 if dashboard history is not needed (source). There is currently no supported way to manually trigger metadata cleanup without a restart, and GCS sharding/distributed GCS is not yet available, though there is interest in future improvements. For now, regular restarts and/or cluster partitioning are the only robust solutions at your scale.

Would you like a detailed breakdown of which config parameters to set and how to automate cluster restarts?

Sources:

Hint: Mention @RunLLM in the post for followups.