How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am encountering a curious issue - when i start a pod in my cluster, and run ray, I get a crash every time I spawn 30 workers. Somehow it consistently happens with 30 (or around 30, like 29/32) workers, does not matter how many CPUs I assign to the ray cluster.
I have 2 ways to reproduce this:
- Reproducing [Core] [Bug] When the created .remote() list is greater than 110, an error will occur when executing ray.get() · Issue #21415 · ray-project/ray · GitHub (this github issue references the bug as happening on Windows, but I am having it on Unix)
ray.init(num_cpus=2)
@ray.remote
class Counter(object):
def __init__(self):
self.n = 0
def increment(self):
self.n += 1
def read(self):
return self.n
counters1 = [Counter.remote() for i in range(200)]
[c.increment.remote() for c in counters1]
futures1 = [c.read.remote() for c in counters1]
get_data = ray.get(futures1)
print(get_data)
print(len(get_data))
This will cause a crash:
2022-07-18 11:36:37,567 WARNING worker.py:1404 -- WARNING: 32 PYTHON worker processes have been started on node: f7a55588de2a972edc060f4237e53c250139b96ab15edbfeb667f77e with address: 10.245.93.27. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Exception in thread ray_print_logs:
Traceback (most recent call last):
File "/home/cdsw/.conda/envs/dialogflow-ray/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/home/cdsw/.conda/envs/dialogflow-ray/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/home/cdsw/.local/lib/python3.9/site-packages/ray/worker.py", line 475, in print_logs
data = subscriber.poll()
File "/home/cdsw/.local/lib/python3.9/site-packages/ray/_private/gcs_pubsub.py", line 351, in poll
self._poll_locked(timeout=timeout)
File "/home/cdsw/.local/lib/python3.9/site-packages/ray/_private/gcs_pubsub.py", line 240, in _poll_locked
fut = self._stub.GcsSubscriberPoll.future(
File "/home/cdsw/.local/lib/python3.9/site-packages/grpc/_channel.py", line 972, in future
call = self._managed_call(
File "/home/cdsw/.local/lib/python3.9/site-packages/grpc/_channel.py", line 1306, in create
_run_channel_spin_thread(state)
File "/home/cdsw/.local/lib/python3.9/site-packages/grpc/_channel.py", line 1270, in _run_channel_spin_thread
channel_spin_thread.start()
File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 117, in grpc._cython.cygrpc.ForkManagedThread.start
File "/home/cdsw/.conda/envs/dialogflow-ray/lib/python3.9/threading.py", line 892, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Stack (most recent call first):
File "/home/cdsw/.local/lib/python3.9/site-packages/ray/worker.py", line 1576 in connect
File "/home/cdsw/.local/lib/python3.9/site-packages/ray/workers/default_worker.py", line 211 in <module>
[2022-07-18 11:37:37,615 E 33913 33913] (raylet) worker_pool.cc:502: Some workers of the worker process(35070) have not registered within the timeout. The process is dead, probably it crashed during start.
[2022-07-18 11:37:37,659 C 33913 33913] (raylet) worker_pool.cc:588: Failed to start worker with return value system:11: Resource temporarily unavailable
*** StackTrace Information ***
ray::SpdLogMessage::Flush()
ray::RayLog::~RayLog()
ray::raylet::WorkerPool::StartProcess()
ray::raylet::WorkerPool::StartWorkerProcess()
ray::raylet::WorkerPool::PopWorker()::{lambda()#1}::operator()()
ray::raylet::WorkerPool::PopWorker()
ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()
ray::raylet::LocalTaskManager::ScheduleAndDispatchTasks()
ray::raylet::WorkerPool::MonitorStartingWorkerProcess()::{lambda()#1}::operator()()
boost::asio::detail::wait_handler<>::do_complete()
boost::asio::detail::scheduler::do_run_one()
boost::asio::detail::scheduler::run()
boost::asio::io_context::run()
main
__libc_start_main
The second way I get this is by deploying 30 Ray Serve deployments:
@serve.deployment()
class ChannelDeployer:
def __init__(self, hdfs_path):
self.model = hdfs_path
async def __call__(self, request):
return self.model
for i in range(30):
ChannelDeployer.options(route_prefix=f"/infer/model_{i}", name=f"model_{i}", ray_actor_options={"num_cpus": 0.01}).deploy(str(i))
Any idea on why this is happening, or at least where i should start looking?
In particular for the first case, why is ray creating that many workers if i specify only 2 CPU? Any way to set a limit to the number of workers?