Pending Ray Jobs crashing ray cluster

RunLLM · December 2, 2025, 9:50pm

Your Ray head node is likely crashing due to memory or resource exhaustion when handling a high number of pending jobs. This is a known limitation: the head node runs critical system components (GCS, dashboard, job supervisor) and, under high job submission rates or many pending jobs, can become a bottleneck and go out-of-memory (OOM), causing cluster instability or restarts. Increasing GCS/Redis/health check timeouts alone does not address the root cause, which is head node overload from job management and system process memory usage. The Ray memory monitor only kills user tasks/actors when memory is exceeded, but does not prevent new job submissions or proactively reject jobs under high load, so the head can still be overwhelmed and crash if too many jobs are queued or submitted rapidly. This is discussed in detail in the Ray community and best practices guides (Best practices to run multiple jobs on Ray, Dashboard: job interface unstable with many jobs).

Recommended mitigations are: (1) provision more memory and CPU to the head node, (2) set num_cpus=0 on the head node to avoid running user tasks there, (3) implement rate limiting on job submissions to avoid overwhelming the head, and (4) monitor head node resource usage closely. For large-scale or high-concurrency workloads, Ray clusters are not optimized for multi-tenant or long-running, high-job-count scenarios, and you may need to periodically restart the cluster or implement job retention/cleanup mechanisms (Ray FAQ). Would you like a step-by-step breakdown of how to apply these mitigations or more detail on monitoring and debugging head node memory issues?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Best practices to run multiple JOBS on ray Ray Core	4	375	July 1, 2025
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	176	March 12, 2024
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2397	May 25, 2023
Ray cluster deadlocked after drive full Ray Clusters	11	84	December 2, 2025
Ray jobs failing after 250 jobs Ray Clusters	0	275	February 27, 2023

Pending Ray Jobs crashing ray cluster

Related topics