Too many threads in ray worker

I also encounter the same issue on our K8S cluster. Could you point out which merged PR solve this?

I also found similar issues in my case. I’m attempting to run ray within a capsule of the job scheduler we have here. Even if I limit the num_cpus and memory ray should have access to

ray.init(num_cpus=2, _memory=1.2e+11)  # Hard coded for a capsule

It still spawns a huge number of threads per process (~1300):

Even though these processes are likely not to be scheduled and do not consume a lot of memory, I worry about the overhead I might cause to the job scheduler, especially because I plan to run multiple trials and the total # of threads is just going to steadily increase.

Is there an environment variable I can set? Or a cheap trick to avoid it (even if I have to hard-code it)?

Ray creates too many threads. Since the user’s default ulimit of NPROC is usually 4096, it breaks other processes of the same user, I can’t even open a top command in the shell when Ray is running 32 tasks * 100+ threads per task.

Has there been any work around this? I am encountering this issue using Ray rllib. I was not able to ssh into the machine running my job and thought I was just using too many rollout workers, but even after reducing the number of rollout workers, ray is keeping me from ssh’ing into that machine. It is very strange because the ray dashboard seems okay. Any solutions around this?

I have found a workaround: [Core] Too many threads in ray worker · Issue #36936 · ray-project/ray · GitHub

Set RAY_num_server_call_thread env var to a smaller value.

1 Like

Unfortunately, this might cause another round of refactoring for me. Single ray worker (running ray actor) had 110 threads and even after RAY_num_server_call_thread=1 about 55 threads. This is too much because we have 120 actors on a single server.

We noticed that something was wrong when a 5-second task sent to a single actor took 10-15 seconds occasionally, whereas the same task was 5 seconds with a Dask equivalent cluster. I was able to get stable 5-second tasks when cutting ray actors to just 1 and launching individual tasks inside the actor with Python’s own multiprocessing. The task is external library call that does heavy matrix operations.

I suspect the huge number of threads causes context switches and trashing of caches which has a detrimental effect on our algorithm.

I have tried to bind cores, niceness, holding gil lock, play with NUMA (we have 4 sockets in a server).

If anyone knows a way to reduce the number of threads even more, I would be interested in knowing about it.