System Outage after starting a few jobs

Daraan · November 15, 2025, 6:39pm

1. Severity of the issue: (select one)
High: Completely blocks me.
Nightmare: Blocks whole System

2. Environment:

Ray version: 2.49.2
Python version: 3.10
OS: Linux 5.14.0-570.33.2.el9_6.x86_64
Cloud/Infrastructure: Local network

3. What happened vs. what you expected:
I start 5-8 jobs, each job has 1-8 actors running. Often most of these actors are paused and just 1-2 are running. The system has enough CPUs and RAM. I have a head node for the ray cluster head and 2-3 remote nodes with CPUs.
First launched normal python scripts, then I tried with the Job Submit API to spread the entrypoints over the nodes, however in all cases pretty soon the system breaks down.
Soon after I started the jobs, the system massively slows down. Terminal via SSH is ultra slow it takes 1-2 min to login and I have to press CTRL+C to even get a bash shell (i.e. .bashrc is not even executed). Every executed command is slow, e.g. ps, kill takes a while, other command like top and even ray status or ray stop do not produce any output and are basically non functional.

The resource readings I can see are fine, tons of GB RAM left and multiple idle CPUs, network speed is 0.5 - 10 GBits/s (net IO not tested during slow down), so I wonder what can cause such a big system fuck up that not even ssh is working.
I looked at the logs and so nothing that seems to points in the right direction.
My ray remote clusters stop at some point as they do not get messages from the head node anymore due to the massive slow down.
I looked through the ray and job logs but nothing sticks out. The logs continue normally until the sudden slowdown point and then pause until I start killing processes which allows some of them to continue / become unstuck.

What could cause such a massive outtage and system slow down? Any ideas how I could prevent this or debug it to find the cause?

RunLLM · November 15, 2025, 6:40pm

A massive system slowdown with Ray, where SSH and basic commands become unresponsive despite available CPU/RAM, is often caused by process/thread exhaustion or hitting OS resource limits (e.g., max processes, open files), not by Ray’s own resource tracking. This can happen if Ray spawns too many worker processes or threads, overwhelming the OS scheduler and making the system nearly unusable—even if top shows idle CPUs and free RAM. Users have reported similar symptoms, and the solution was to limit the number of Ray workers by setting a lower --num-cpus value when starting Ray, or by adjusting ulimit settings for max user processes and open files. For example, reducing --num-cpus to a value well below the physical core count (e.g., 8 instead of 20) resolved the issue for one user, making the system stable again. See the detailed discussion and solution in Ray distributed training causes SSH disconnection due to process exhaustion.

To debug, monitor the number of processes (e.g., with ps or htop), check ulimit -u and ulimit -n, and try lowering Ray’s --num-cpus or actor resource requirements. If the system becomes unresponsive, it’s likely due to hitting these OS-level limits, not a Ray-internal bug. Also, avoid launching too many jobs or actors at once, and consider using tmux or screen to keep sessions alive during troubleshooting.

Would you like more detail on how to set these limits or further debug the issue?

Sources:

Ray distributed training causes SSH disconnection due to process exhaustion

Hint: Mention @RunLLM in the post for followups.

Daraan · November 15, 2025, 7:50pm

@RunLLM On the first occurance I already had --num-cpus 0 on my head node. I started several python processes in tmux.

Give me more details on the limits that could be problematic, how can I check them?

RunLLM · November 15, 2025, 7:51pm

The most common OS limits that can cause Ray to freeze your system are:

Max user processes (ulimit -u): Limits the total number of processes/threads a user can create. If exceeded, new processes/threads (including SSH, bash, Ray workers) cannot start, causing system-wide slowdowns and failures like “Resource temporarily unavailable”.
Max open files (ulimit -n): Limits the number of file descriptors per process. Ray uses many network connections, so a low limit can cause “Too many open files” errors and system instability.
Cgroup PID limits (pids.max): On some systems (especially containers/Kubernetes), cgroups can restrict the total number of PIDs, which can be much lower than the system default.

How to check these limits:

For your current shell:
```
ulimit -a
```
Look for “max user processes” and “open files”.

For system-wide limits:

cat /proc/sys/kernel/threads-max
cat /proc/sys/kernel/pid_max

For cgroup PID limits (if using containers):
```
cat /sys/fs/cgroup/pids.max
```
To see current process/thread usage:
```
ps -eLf | wc -l
```
or
```
htop
```

Recommended values:

ulimit -u (max user processes): 65536 or higher
ulimit -n (open files): 65536 or higher
System-wide threads/pids: as high as possible, e.g., 4194304

If your limits are much lower, increase them in /etc/security/limits.conf and reboot, or set them before launching Ray. See Ray debug failures guide and Ray distributed training causes SSH disconnection due to process exhaustion for more.

Would you like step-by-step instructions to raise these limits?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	180	March 12, 2024
System will be halted when tasks number is large Ray Core	32	1954	April 28, 2023
Pending Ray Jobs crashing ray cluster Ray Clusters	1	103	December 2, 2025
Ray + slurm crashes early in run Ray Core	0	214	March 21, 2024
Crash when reaching 30 workers Ray Core	6	1960	October 19, 2022

System Outage after starting a few jobs

Related topics