How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
I often meet RuntimeError: can't start new thread
in local deveopment environment. Usually, it caused by call too many tasks/actors at the same time. From the logs, it seems that channels in grpc had exceeded the limit. There is a typical traceback:
Traceback (most recent call last):
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/site-packages/ray/_private/worker.py", line 868, in print_logs
data = subscriber.poll()
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 362, in poll
self._poll_locked(timeout=timeout)
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 249, in _poll_locked
fut = self._stub.GcsSubscriberPoll.future(
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/site-packages/grpc/_channel.py", line 972, in future
call = self._managed_call(
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/site-packages/grpc/_channel.py", line 1306, in create
_run_channel_spin_thread(state)
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/site-packages/grpc/_channel.py", line 1270, in _run_channel_spin_thread
channel_spin_thread.start()
File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 117, in grpc._cython.cygrpc.ForkManagedThread.start
File "/home2/hanwen.qiu/miniconda3/envs/ray_server/lib/python3.8/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
I have tried some environment various, like OMP_NUM_THREADS, OPENBLAS_NUM_THREADS. The outcome is not satisfied with me. Sometimes, it also caused the head node crash too.
Here is my question:
- The reason why this issue occured ?
- How can i avoid this happened? Any related configs i had not noticed before in the document?
Any ideas? Thank the community. I am wondering if I can add a PR to the document to address the confusion that new users may encounter with such issues.