Your Ray head node is likely crashing due to memory or resource exhaustion when handling a high number of pending jobs. This is a known limitation: the head node runs critical system components (GCS, dashboard, job supervisor) and, under high job submission rates or many pending jobs, can become a bottleneck and go out-of-memory (OOM), causing cluster instability or restarts. Increasing GCS/Redis/health check timeouts alone does not address the root cause, which is head node overload from job management and system process memory usage. The Ray memory monitor only kills user tasks/actors when memory is exceeded, but does not prevent new job submissions or proactively reject jobs under high load, so the head can still be overwhelmed and crash if too many jobs are queued or submitted rapidly. This is discussed in detail in the Ray community and best practices guides (Best practices to run multiple jobs on Ray, Dashboard: job interface unstable with many jobs).
Recommended mitigations are: (1) provision more memory and CPU to the head node, (2) set num_cpus=0 on the head node to avoid running user tasks there, (3) implement rate limiting on job submissions to avoid overwhelming the head, and (4) monitor head node resource usage closely. For large-scale or high-concurrency workloads, Ray clusters are not optimized for multi-tenant or long-running, high-job-count scenarios, and you may need to periodically restart the cluster or implement job retention/cleanup mechanisms (Ray FAQ). Would you like a step-by-step breakdown of how to apply these mitigations or more detail on monitoring and debugging head node memory issues?
Sources:
Hint: Mention @RunLLM in the post for followups.