Best practices to run multiple JOBS on ray

Hi guys,
I am new to ray.
I’m running training jobs in Ray using the latest version. Each job is a Python script launched via the Ray Job Submission SDK.I have setup my training as an remote function with specific memory allocated.
When I submit multiple jobs at a time (e.g., simulating high load), the Ray autoscaler scales up the worker nodes correctly. However, the Ray head node eventually goes OOM. On inspection, there are no training jobs running on the head node — only the JobSupervisor and allocation processes, which is expected in my case.
This seems to be a fundamental limit: no matter how much I scale horizontally with workers, the head node becomes a bottleneck under high submission rates and crashes or causes jobs to hang/fail.
I also noticed that the jobs in pending status go to failed but the jobs in running status were running for a long time(i am in a case were i cant mention an timeout).Also noticed in some cased the jobs are accepted by the head node even though the mem use age is high

What is the best parctice to submit or run jobs in Ray such that I can achieve the maximum number of jobs running in parallel without causing the head node to go out-of-memory under load?

Additionally, I noticed that jobs in PENDING status fail when the head node runs out of memory, which is expected. But jobs that were already in a RUNNING state often continue running indefinitely, even when the head node is clearly unstable. I’m unable to enforce timeouts for these jobs in my use case. Also, in many cases, the Ray head continues accepting new jobs even when memory usage is critically high, rather than rejecting or deferring them.

Why does the Ray head continue accepting jobs under critical memory pressure, and why do already running jobs not get cleaned up or cancelled even when the head is effectively non-functional?

Hi Vinu and welcome to the Ray community! We do have a few resources in our docs that might help with the OOM issues here: Debugging Memory Issues — Ray 2.47.1

We also have a guide on how to deploy large clusters which might be useful, but I’ll summarize a few of the findings here: Best practices for deploying large clusters — Ray 2.47.1

  • Give more resources to your head node so it can handle the traffic better / high submission rates.
  • Make sure num_cpus=0 is set on your Head node so you’re sure no other tasks are running on it.
  • Maybe you can limit job submission rates to prevent overwhelming your head node, like implement rate limiting.

Have you also taken a look at the Ray dashboard when your head node starts failing to see if there’s any particular thing that is causing it to fail, or is it just the volume of jobs?

Why does the Ray head continue accepting jobs under critical memory pressure, and why do already running jobs not get cleaned up or cancelled even when the head is effectively non-functional?

The Ray memory monitor is designed to kill user tasks or actors when memory usage exceeds a threshold, but it does not prevent the head node from accepting new job submissions. The monitor’s main goal is to preemptively kill memory-hungry workers to avoid OOM, but it does not reject new jobs at the API level when the head node is under memory pressure. This is discussed a bit here: Out-Of-Memory Prevention — Ray 2.47.1

Basically, system processes continue to accept jobs even if the node is close to OOM, because the memory monitor only acts reactively by killing processes, not by rejecting new submissions. This might also become unstable if the head node runs OOM.

They discuss some workarounds here: Automatic Memory Scheduling for ML Workloads in Ray

I’d start off with provisioning more data/resources to your head node first to see if that helps though, let me know if the docs are helpful too.

Hi Christina, thanks for the reply.

How well does Kueue work with Ray?

Can Kueue help me run multiple jobs and orchestrate them so that the jobs are scheduled on the Ray head only when memory is available in my use case? I’m uncertain whether gang scheduling is applicable in my scenario, as training jobs are triggered individually based on user requests.

Hi Vinu,
I’m personally not an expert with Kueue, but we do have an article here that describes a Kueue x KubeRay x Ray integration :slight_smile: https://cloud.google.com/blog/products/containers-kubernetes/using-kuberay-and-kueue-to-orchestrate-ray-applications-in-gke hopefully this is helpful, lmk!

Hi Christina, thanks for the reply.

One issue I encountered was that the head node accepts jobs even when its memory is already unstable. I came across Python SDK Overview — Ray 2.47.1, and in the Specifying CPU and GPU resources section, I noticed a parameter called entrypoint_memory.

It states:

“If any of these arguments are specified, the entrypoint script will be scheduled on a node with at least the specified resources, instead of the head node, which is the default.”

I tried using this and started testing with multiple training job submissions. However, I couldn’t find any jobs or entrypoint scripts running on the worker nodes, even when the head node didn’t have sufficient memory.

Is there any reason why this might be happening?

Here is the configuration I used during job submission:
client.submit_job(entrypoint=entrypoint, runtime_env=runtime_env, entrypoint_memory=100 * 1024 * 1024)