I also encounter the same issue on our K8S cluster. Could you point out which merged PR solve this?
I also found similar issues in my case. I’m attempting to run ray within a capsule of the job scheduler we have here. Even if I limit the num_cpus and memory ray should have access to
ray.init(num_cpus=2, _memory=1.2e+11) # Hard coded for a capsule
It still spawns a huge number of threads per process (~1300):
Even though these processes are likely not to be scheduled and do not consume a lot of memory, I worry about the overhead I might cause to the job scheduler, especially because I plan to run multiple trials and the total # of threads is just going to steadily increase.
Is there an environment variable I can set? Or a cheap trick to avoid it (even if I have to hard-code it)?
Ray creates too many threads. Since the user’s default ulimit of NPROC is usually 4096, it breaks other processes of the same user, I can’t even open a
top command in the shell when Ray is running 32 tasks * 100+ threads per task.
Has there been any work around this? I am encountering this issue using Ray rllib. I was not able to ssh into the machine running my job and thought I was just using too many rollout workers, but even after reducing the number of rollout workers, ray is keeping me from ssh’ing into that machine. It is very strange because the ray dashboard seems okay. Any solutions around this?