Pending Ray Jobs crashing ray cluster

Hey Team,

i have been running into an issue where whenever i have many pending ray jobs queued e.g. 80+ my ray head would crash and the whole cluster will restart, during this time the cluster is very busy with processing, i have modified just about every env var to increase GCS / Redis / health check timeouts, the problem is i am not sure how to even approach the problem and would appreciate some help. i have 11 nodes, 1 ray head node with 30 cpus (num_cpus=0) and 45 gb ram, 10 worker nodes with 45 cpus and 56gb ram. none of the workloads require using the ray object store.

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.52.0
  • Python version: 3.11
  • OS: debian
  • Cloud/Infrastructure: on prem kubernetes
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: many ray jobs are submitted, the head processes as many jobs as it can, then queues jobs (pending) where it is unable to.
  • Actual: the head dies and crashes when the pending jobs number is above roughly 80 and due to lost connection with the head the whole cluster crashes.

Your Ray head node is likely crashing due to memory or resource exhaustion when handling a high number of pending jobs. This is a known limitation: the head node runs critical system components (GCS, dashboard, job supervisor) and, under high job submission rates or many pending jobs, can become a bottleneck and go out-of-memory (OOM), causing cluster instability or restarts. Increasing GCS/Redis/health check timeouts alone does not address the root cause, which is head node overload from job management and system process memory usage. The Ray memory monitor only kills user tasks/actors when memory is exceeded, but does not prevent new job submissions or proactively reject jobs under high load, so the head can still be overwhelmed and crash if too many jobs are queued or submitted rapidly. This is discussed in detail in the Ray community and best practices guides (Best practices to run multiple jobs on Ray, Dashboard: job interface unstable with many jobs).

Recommended mitigations are: (1) provision more memory and CPU to the head node, (2) set num_cpus=0 on the head node to avoid running user tasks there, (3) implement rate limiting on job submissions to avoid overwhelming the head, and (4) monitor head node resource usage closely. For large-scale or high-concurrency workloads, Ray clusters are not optimized for multi-tenant or long-running, high-job-count scenarios, and you may need to periodically restart the cluster or implement job retention/cleanup mechanisms (Ray FAQ). Would you like a step-by-step breakdown of how to apply these mitigations or more detail on monitoring and debugging head node memory issues?

Sources:

Hint: Mention @RunLLM in the post for followups.