1. Severity of the issue: (select one)
High: Completely blocks me.
Nightmare: Blocks whole System
2. Environment:
Ray version: 2.49.2
Python version: 3.10
OS: Linux 5.14.0-570.33.2.el9_6.x86_64
Cloud/Infrastructure: Local network
3. What happened vs. what you expected:
I start 5-8 jobs, each job has 1-8 actors running. Often most of these actors are paused and just 1-2 are running. The system has enough CPUs and RAM. I have a head node for the ray cluster head and 2-3 remote nodes with CPUs.
First launched normal python scripts, then I tried with the Job Submit API to spread the entrypoints over the nodes, however in all cases pretty soon the system breaks down.
Soon after I started the jobs, the system massively slows down. Terminal via SSH is ultra slow it takes 1-2 min to login and I have to press CTRL+C to even get a bash shell (i.e. .bashrc is not even executed). Every executed command is slow, e.g. ps, kill takes a while, other command like top and even ray status or ray stop do not produce any output and are basically non functional.
The resource readings I can see are fine, tons of GB RAM left and multiple idle CPUs, network speed is 0.5 - 10 GBits/s (net IO not tested during slow down), so I wonder what can cause such a big system fuck up that not even ssh is working.
I looked at the logs and so nothing that seems to points in the right direction.
My ray remote clusters stop at some point as they do not get messages from the head node anymore due to the massive slow down.
I looked through the ray and job logs but nothing sticks out. The logs continue normally until the sudden slowdown point and then pause until I start killing processes which allows some of them to continue / become unstuck.
What could cause such a massive outtage and system slow down? Any ideas how I could prevent this or debug it to find the cause?
A massive system slowdown with Ray, where SSH and basic commands become unresponsive despite available CPU/RAM, is often caused by process/thread exhaustion or hitting OS resource limits (e.g., max processes, open files), not by Ray’s own resource tracking. This can happen if Ray spawns too many worker processes or threads, overwhelming the OS scheduler and making the system nearly unusable—even if top shows idle CPUs and free RAM. Users have reported similar symptoms, and the solution was to limit the number of Ray workers by setting a lower --num-cpus value when starting Ray, or by adjusting ulimit settings for max user processes and open files. For example, reducing --num-cpus to a value well below the physical core count (e.g., 8 instead of 20) resolved the issue for one user, making the system stable again. See the detailed discussion and solution in Ray distributed training causes SSH disconnection due to process exhaustion.
To debug, monitor the number of processes (e.g., with ps or htop), check ulimit -u and ulimit -n, and try lowering Ray’s --num-cpus or actor resource requirements. If the system becomes unresponsive, it’s likely due to hitting these OS-level limits, not a Ray-internal bug. Also, avoid launching too many jobs or actors at once, and consider using tmux or screen to keep sessions alive during troubleshooting.
Would you like more detail on how to set these limits or further debug the issue?
The most common OS limits that can cause Ray to freeze your system are:
Max user processes (ulimit -u): Limits the total number of processes/threads a user can create. If exceeded, new processes/threads (including SSH, bash, Ray workers) cannot start, causing system-wide slowdowns and failures like “Resource temporarily unavailable”.
Max open files (ulimit -n): Limits the number of file descriptors per process. Ray uses many network connections, so a low limit can cause “Too many open files” errors and system instability.
Cgroup PID limits (pids.max): On some systems (especially containers/Kubernetes), cgroups can restrict the total number of PIDs, which can be much lower than the system default.