Too many threads in ray worker

heliar-k · September 5, 2023, 5:56am

I also encounter the same issue on our K8S cluster. Could you point out which merged PR solve this?

Karapo · September 6, 2023, 9:53am

I also found similar issues in my case. I’m attempting to run ray within a capsule of the job scheduler we have here. Even if I limit the num_cpus and memory ray should have access to

ray.init(num_cpus=2, _memory=1.2e+11)  # Hard coded for a capsule

It still spawns a huge number of threads per process (~1300):

Even though these processes are likely not to be scheduled and do not consume a lot of memory, I worry about the overhead I might cause to the job scheduler, especially because I plan to run multiple trials and the total # of threads is just going to steadily increase.

Is there an environment variable I can set? Or a cheap trick to avoid it (even if I have to hard-code it)?

beantowel · November 2, 2023, 3:21am

Ray creates too many threads. Since the user’s default ulimit of NPROC is usually 4096, it breaks other processes of the same user, I can’t even open a top command in the shell when Ray is running 32 tasks * 100+ threads per task.

max_ronda · November 22, 2023, 6:18pm

Has there been any work around this? I am encountering this issue using Ray rllib. I was not able to ssh into the machine running my job and thought I was just using too many rollout workers, but even after reducing the number of rollout workers, ray is keeping me from ssh’ing into that machine. It is very strange because the ray dashboard seems okay. Any solutions around this?

messense · May 29, 2024, 2:17am

I have found a workaround: [Core] Too many threads in ray worker · Issue #36936 · ray-project/ray · GitHub

Set RAY_num_server_call_thread env var to a smaller value.

thvaisa · January 8, 2025, 9:38pm

Unfortunately, this might cause another round of refactoring for me. Single ray worker (running ray actor) had 110 threads and even after RAY_num_server_call_thread=1 about 55 threads. This is too much because we have 120 actors on a single server.

We noticed that something was wrong when a 5-second task sent to a single actor took 10-15 seconds occasionally, whereas the same task was 5 seconds with a Dask equivalent cluster. I was able to get stable 5-second tasks when cutting ray actors to just 1 and launching individual tasks inside the actor with Python’s own multiprocessing. The task is external library call that does heavy matrix operations.

I suspect the huge number of threads causes context switches and trashing of caches which has a detrimental effect on our algorithm.

I have tried to bind cores, niceness, holding gil lock, play with NUMA (we have 4 sockets in a server).

If anyone knows a way to reduce the number of threads even more, I would be interested in knowing about it.

Topic		Replies	Views
System will be halted when tasks number is large Ray Core	32	1816	April 28, 2023
Crash when reaching 30 workers Ray Core	6	1865	October 19, 2022
Processing performance of tasks Ray Core	14	818	March 8, 2021
Ray on single machine. No threading? Ray Core	10	2181	April 2, 2021
Violation of number of reserved CPUs for remote function Ray Core	6	550	March 15, 2022

Too many threads in ray worker

Related topics