System will be halted when tasks number is large

I browsed the logs you posted.

  1. a lot of ray worker processes get started and died soon. (see gcs_server.out below)
  2. the tasks failed to run because the workers died (see python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1864.log below)
  3. the worker died because of Unhandled exception: St12system_error. what(): Resource temporarily unavailable (see raylet.err below)

@jjyao Do you know what are the possible causes of this exception?

gcs_server.out

[2023-03-15 09:04:36,112 W 130953 130953] (gcs_server) gcs_worker_manager.cc:55: Reporting worker exit, worker id = 3c6c4ee9801437c0eb2ccaecc92c6a2361da93456329148268c104a2, node id = ffffffffffffffffffffffffffffffffffffffffffffffffffffffff, address = , exit_type = SYSTEM_ERROR, exit_detail = Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors… Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.

python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1864.log

[2023-03-15 09:04:36,208 I 1864 1927] raylet_client.cc:381: Error returning worker: Invalid: Returned worker does not exist any more

[2023-03-15 09:04:36,208 I 1864 1927] task_manager.cc:467: task 91581beb08e6c9deffffffffffffffffffffffff01000000 retries left: 3, oom retries left: -1, task failed due to oom: 0

[2023-03-15 09:04:36,208 I 1864 1927] task_manager.cc:471: Attempting to resubmit task 91581beb08e6c9deffffffffffffffffffffffff01000000 for attempt number: 0

[2023-03-15 09:04:36,209 I 1864 1927] core_worker.cc:350: Will resubmit task after a 0ms delay: Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=remote_object, class_name=, function_name=transform_rand_tensor, function_hash=dd13dc2a0abe4e4484330a28e4065226}, task_id=91581beb08e6c9deffffffffffffffffffffffff01000000, task_name=transform_rand_tensor, job_id=01000000, num_args=2, num_returns=1, depth=1, attempt_number=1, max_retries=3

raylet.err

[2023-03-15 09:04:35,991 E 2035 2035] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2049 2049] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2042 2042] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2018 2018] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2064 2064] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2033 2033] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2037 2037] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2040 2040] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 1986 1986] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2012 2012] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2010 2010] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2044 2044] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2023 2023] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2047 2047] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2050 2050] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
Traceback (most recent call last):
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py”, line 210, in
Traceback (most recent call last):
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py”, line 210, in
ray._private.worker.connect(
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py”, line 2111, in connect
ray._private.worker.connect(
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py”, line 2111, in connect
worker.import_thread.start()
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/import_thread.py”, line 61, in start
self.t.start()
File “/home/binli/anaconda3/lib/python3.9/threading.py”, line 899, in start
worker.import_thread.start()
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/import_thread.py”, line 61, in start
_start_new_thread(self._bootstrap, ())
self.t.start()
File “/home/binli/anaconda3/lib/python3.9/threading.py”, line 899, in start
RuntimeError: can’t start new thread
_start_new_thread(self._bootstrap, ())
RuntimeError: can’t start new thread