Crash when reaching 30 workers

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am encountering a curious issue - when i start a pod in my cluster, and run ray, I get a crash every time I spawn 30 workers. Somehow it consistently happens with 30 (or around 30, like 29/32) workers, does not matter how many CPUs I assign to the ray cluster.

I have 2 ways to reproduce this:

  1. Reproducing [Core] [Bug] When the created .remote() list is greater than 110, an error will occur when executing ray.get() · Issue #21415 · ray-project/ray · GitHub (this github issue references the bug as happening on Windows, but I am having it on Unix)
ray.init(num_cpus=2)

@ray.remote
class Counter(object):
    def __init__(self):
         self.n = 0

    def increment(self):
        self.n += 1

    def read(self):
        return self.n

counters1 = [Counter.remote() for i in range(200)]
[c.increment.remote() for c in counters1]
futures1 = [c.read.remote() for c in counters1]

get_data = ray.get(futures1)
print(get_data)
print(len(get_data))

This will cause a crash:

2022-07-18 11:36:37,567	WARNING worker.py:1404 -- WARNING: 32 PYTHON worker processes have been started on node: f7a55588de2a972edc060f4237e53c250139b96ab15edbfeb667f77e with address: 10.245.93.27. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).

Exception in thread ray_print_logs:
Traceback (most recent call last):
  File "/home/cdsw/.conda/envs/dialogflow-ray/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/cdsw/.conda/envs/dialogflow-ray/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/cdsw/.local/lib/python3.9/site-packages/ray/worker.py", line 475, in print_logs
    data = subscriber.poll()
  File "/home/cdsw/.local/lib/python3.9/site-packages/ray/_private/gcs_pubsub.py", line 351, in poll
    self._poll_locked(timeout=timeout)
  File "/home/cdsw/.local/lib/python3.9/site-packages/ray/_private/gcs_pubsub.py", line 240, in _poll_locked
    fut = self._stub.GcsSubscriberPoll.future(
  File "/home/cdsw/.local/lib/python3.9/site-packages/grpc/_channel.py", line 972, in future
    call = self._managed_call(
  File "/home/cdsw/.local/lib/python3.9/site-packages/grpc/_channel.py", line 1306, in create
    _run_channel_spin_thread(state)
  File "/home/cdsw/.local/lib/python3.9/site-packages/grpc/_channel.py", line 1270, in _run_channel_spin_thread
    channel_spin_thread.start()
  File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 117, in grpc._cython.cygrpc.ForkManagedThread.start
  File "/home/cdsw/.conda/envs/dialogflow-ray/lib/python3.9/threading.py", line 892, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Stack (most recent call first):
  File "/home/cdsw/.local/lib/python3.9/site-packages/ray/worker.py", line 1576 in connect
  File "/home/cdsw/.local/lib/python3.9/site-packages/ray/workers/default_worker.py", line 211 in <module>
[2022-07-18 11:37:37,615 E 33913 33913] (raylet) worker_pool.cc:502: Some workers of the worker process(35070) have not registered within the timeout. The process is dead, probably it crashed during start.
[2022-07-18 11:37:37,659 C 33913 33913] (raylet) worker_pool.cc:588: Failed to start worker with return value system:11: Resource temporarily unavailable
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::raylet::WorkerPool::StartProcess()
    ray::raylet::WorkerPool::StartWorkerProcess()
    ray::raylet::WorkerPool::PopWorker()::{lambda()#1}::operator()()
    ray::raylet::WorkerPool::PopWorker()
    ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()
    ray::raylet::LocalTaskManager::ScheduleAndDispatchTasks()
    ray::raylet::WorkerPool::MonitorStartingWorkerProcess()::{lambda()#1}::operator()()
    boost::asio::detail::wait_handler<>::do_complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

The second way I get this is by deploying 30 Ray Serve deployments:


@serve.deployment()
class ChannelDeployer:
    def __init__(self, hdfs_path):
        self.model = hdfs_path
        
    async def __call__(self, request):
        return self.model

for i in range(30):
    ChannelDeployer.options(route_prefix=f"/infer/model_{i}", name=f"model_{i}", ray_actor_options={"num_cpus": 0.01}).deploy(str(i))

Any idea on why this is happening, or at least where i should start looking?

In particular for the first case, why is ray creating that many workers if i specify only 2 CPU? Any way to set a limit to the number of workers?

I’m suspecting you are hitting some system level limit as the message system:11: Resource temporarily unavailable indicates.

According to the man fork followings are the possible reasons:
https://man7.org/linux/man-pages/man2/fork.2.html

ERRORS         top
       EAGAIN A system-imposed limit on the number of threads was
              encountered.  There are a number of limits that may
              trigger this error:

              *  the RLIMIT_NPROC soft resource limit (set via
                 setrlimit(2)), which limits the number of processes and
                 threads for a real user ID, was reached;

              *  the kernel's system-wide limit on the number of
                 processes and threads, /proc/sys/kernel/threads-max,
                 was reached (see proc(5));

              *  the maximum number of PIDs, /proc/sys/kernel/pid_max,
                 was reached (see proc(5)); or

              *  the PID limit (pids.max) imposed by the cgroup "process
                 number" (PIDs) controller was reached.

Thanks a lot @Chen_Shen for replying!

Indeed, it looks like creating 30 workers in Ray forks a total of 900 threads (is this expected?) and we are consistently facing crashes when we reach 900.

My other limits are very high so i think the culprit is pids.max set by cgroup which is just 1024. I will see if we are able to change that on our cluster.

Is it normal for Ray to spawn 900+ threads for 30 workers, or are we doing something wrong?

pids.max set by cgroup which is just 1024. I will see if we are able to change that on our cluster.

Ah, i think this is very likely the culprit.

Indeed, I monitored pids.current when i was deploying those 30 serve endpoints and i confirmed it crashes when it reaches 1024.

This is the default on Openshift, so for anyone looking to run Ray on openshift and ending up here looking for help, make sure to change your pids_limit default: Machine configuration tasks | Post-installation configuration | OpenShift Container Platform 4.9.

Aside, is there a way to limit threads per worker? or are 900+ thread over 30 Ray workers required?

Closing the loop - we updated the max pid cgroup, and now our program does not crash anymore.

1 Like

Just to chime in as well because I recently encountered the same error. We didn’t have a cgroup limit, but for me simply running ulimit -u 65536 right before ray start (on each node) solved it.

1 Like