Big cluster job failing due to SIGBUS in plasma

Hello, I am submitting a big job to my Ray cluster (43 nodes, 1952 CPUs). The job always runs for a few minutes and then workers start failing with the following exceptions and stack dumps.

It is reporting a SIGBUS in plasma due to some memory allocation issue. I can reproduce 100% of the time.

What is the best way to get help with chasing this down? The project would be too complex to try to replicate elsewhere, but I am happy to provide useful debug information and help with testing fixes.

Thanks!

2021-06-18 14:27:21,029	WARNING worker.py:1114 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 712, in ray._raylet.restore_spilled_objects_handler
  File "python/ray/_raylet.pyx", line 715, in ray._raylet.restore_spilled_objects_handler
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 535, in restore_spilled_objects
    url_with_offset_list)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 307, in restore_spilled_objects
    owner_address)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 102, in _put_object_to_store
    object_ref, owner_address)
  File "python/ray/_raylet.pyx", line 1081, in ray._raylet.CoreWorker.put_file_like_object
OSError: [Errno 14] Bad address
An unexpected internal error occurred while the IO worker was restoring spilled objects.
(autoscaler +1m21s) Adding 1 nodes of type local.cluster.node.
[2021-06-18 14:27:22,088 INFO   ] Pipeline ETR50 step 1 = ExtraTreesRegressor {'min_samples_leaf': 50, 'n_estimators': 1000, 'n_jobs': 24}
(raylet) [2021-06-18 14:27:22,856 E 212 228] logging.cc:441: *** Aborted at 1624051642 (unix time) try "date -d @1624051642" if you are using GNU date ***
(raylet) [2021-06-18 14:27:22,857 E 212 228] logging.cc:441: PC: @                0x0 (unknown)
(raylet) [2021-06-18 14:27:22,858 E 212 228] logging.cc:441: *** SIGBUS (@0x7fba7ff38108) received by PID 212 (TID 0x7fbaadffb700) from PID 2146664712; stack trace: ***
(raylet) [2021-06-18 14:27:22,858 E 212 228] logging.cc:441:     @     0x55bfbbbf8c4f google::(anonymous namespace)::FailureSignalHandler()
(raylet) [2021-06-18 14:27:22,858 E 212 228] logging.cc:441:     @     0x7fbabf1a33c0 (unknown)
(raylet) [2021-06-18 14:27:22,859 E 212 228] logging.cc:441:     @     0x55bfbb8f8bf2 dlmalloc
(raylet) [2021-06-18 14:27:22,859 E 212 228] logging.cc:441:     @     0x55bfbb8f9af0 plasma::internal_memalign()
(raylet) [2021-06-18 14:27:22,860 E 212 228] logging.cc:441:     @     0x55bfbb8e0b41 plasma::PlasmaAllocator::Memalign()
(raylet) [2021-06-18 14:27:22,860 E 212 228] logging.cc:441:     @     0x55bfbb8ed76e plasma::PlasmaStore::AllocateMemory()
(raylet) [2021-06-18 14:27:22,861 E 212 228] logging.cc:441:     @     0x55bfbb8edd0b plasma::PlasmaStore::CreateObject()
(raylet) [2021-06-18 14:27:22,861 E 212 228] logging.cc:441:     @     0x55bfbb8ee16d plasma::PlasmaStore::HandleCreateObjectRequest()
(raylet) [2021-06-18 14:27:22,862 E 212 228] logging.cc:441:     @     0x55bfbb8f2579 plasma::CreateRequestQueue::ProcessRequest()
(raylet) [2021-06-18 14:27:22,863 E 212 228] logging.cc:441:     @     0x55bfbb8f3ee6 plasma::CreateRequestQueue::ProcessRequests()
(raylet) [2021-06-18 14:27:22,863 E 212 228] logging.cc:441:     @     0x55bfbb8e7626 plasma::PlasmaStore::ProcessCreateRequests()
(raylet) [2021-06-18 14:27:22,864 E 212 228] logging.cc:441:     @     0x55bfbb8f017a plasma::PlasmaStore::ProcessMessage()
(raylet) [2021-06-18 14:27:22,864 E 212 228] logging.cc:441:     @     0x55bfbb8e21ff std::_Function_handler<>::_M_invoke()
(raylet) [2021-06-18 14:27:22,865 E 212 228] logging.cc:441:     @     0x55bfbb901d86 _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionEElRKSt6vectorIhSaIhEEEZN6plasma6Client6CreateESt8functionIFNS1_6StatusES0_ISB_ENSA_7flatbuf11MessageTypeES8_EEON5boost4asio19basic_stream_socketINSK_7generic15stream_protocolENSK_8executorEEEEUlS3_lS8_E_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
(raylet) [2021-06-18 14:27:22,866 E 212 228] logging.cc:441:     @     0x55bfbbbbb672 ray::ClientConnection::ProcessMessage()
(raylet) [2021-06-18 14:27:22,866 E 212 228] logging.cc:441:     @     0x55bfbbbb81a8 boost::asio::detail::read_op<>::operator()()
(raylet) [2021-06-18 14:27:22,867 E 212 228] logging.cc:441:     @     0x55bfbbbb855b boost::asio::detail::executor_function<>::do_complete()
(raylet) [2021-06-18 14:27:22,868 E 212 228] logging.cc:441:     @     0x55bfbb78d5a0 boost::asio::io_context::executor_type::dispatch<>()
(raylet) [2021-06-18 14:27:22,869 E 212 228] logging.cc:441:     @     0x55bfbbbb8f53 boost::asio::executor::dispatch<>()
(raylet) [2021-06-18 14:27:22,870 E 212 228] logging.cc:441:     @     0x55bfbbbb9148 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet) [2021-06-18 14:27:22,871 E 212 228] logging.cc:441:     @     0x55bfbbf6e011 boost::asio::detail::scheduler::do_run_one()
(raylet) [2021-06-18 14:27:22,873 E 212 228] logging.cc:441:     @     0x55bfbbf6e141 boost::asio::detail::scheduler::run()
(raylet) [2021-06-18 14:27:22,873 E 212 228] logging.cc:441:     @     0x55bfbbf70340 boost::asio::io_context::run()
(raylet) [2021-06-18 14:27:22,874 E 212 228] logging.cc:441:     @     0x55bfbb8e13fd plasma::PlasmaStoreRunner::Start()
(raylet) [2021-06-18 14:27:22,874 E 212 228] logging.cc:441:     @     0x55bfbb88e697 std::thread::_State_impl<>::_M_run()
(raylet) [2021-06-18 14:27:22,875 E 212 228] logging.cc:441:     @     0x55bfbbfab020 execute_native_thread_routine
(raylet) [2021-06-18 14:27:22,875 E 212 228] logging.cc:441:     @     0x7fbabf197609 start_thread
(raylet) [2021-06-18 14:27:22,875 E 212 228] logging.cc:441:     @     0x7fbabed68293 clone
(autoscaler +1m28s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m36s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m44s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m52s) Adding 1 nodes of type local.cluster.node.
2021-06-18 14:27:52,116	WARNING worker.py:1114 -- The node with node id: 58da865a3ddbee6443f5f68c504a1a0a39a14df87d1a149ba7391497 and ip: 10.0.1.34 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(pid=142, ip=10.0.1.28) 2021-06-18 14:27:52,205	ERROR worker.py:409 -- SystemExit was raised from the worker
(pid=142, ip=10.0.1.28) Traceback (most recent call last):
(pid=142, ip=10.0.1.28)   File "python/ray/_raylet.pyx", line 595, in ray._raylet.task_execution_handler
(pid=142, ip=10.0.1.28)   File "python/ray/_raylet.pyx", line 453, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28)   File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28)   File "python/ray/_raylet.pyx", line 497, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28)   File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
(pid=142, ip=10.0.1.28)     return function(*args, **kwargs)
(pid=142, ip=10.0.1.28)   File "/ceph/home/djakubiec/elcano/submissions/dan6/djlearn.py", line 411, in processGroupTrain
(pid=142, ip=10.0.1.28)     model.pipeline.fit(groupResult['modelData'].train.X, groupResult['modelData'].train.yTrue)
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py", line 346, in fit
(pid=142, ip=10.0.1.28)     self._final_estimator.fit(Xt, y, **fit_params_last_step)
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 393, in fit
(pid=142, ip=10.0.1.28)     for i, t in enumerate(trees))
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1054, in __call__
(pid=142, ip=10.0.1.28)     self.retrieve()
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 933, in retrieve
(pid=142, ip=10.0.1.28)     self._output.extend(job.get(timeout=self.timeout))
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
(pid=142, ip=10.0.1.28)     self.wait(timeout)
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
(pid=142, ip=10.0.1.28)     self._event.wait(timeout)
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/threading.py", line 552, in wait
(pid=142, ip=10.0.1.28)     signaled = self._cond.wait(timeout)
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/threading.py", line 296, in wait
(pid=142, ip=10.0.1.28)     waiter.acquire()
(pid=142, ip=10.0.1.28)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 406, in sigterm_handler
(pid=142, ip=10.0.1.28)     sys.exit(1)
(pid=142, ip=10.0.1.28) SystemExit: 1
2021-06-18 14:27:52,270	WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: e11fe2800445c79affffffffffffffffffffffff01000000 Worker ID: 34df4752ff905287f07553d1f4d3ec1cbd1dfe5fba57f51efd665533 Node ID: 8d7cff80b5bc5b838a93c495e569845209e6818740a66977172e223d Worker IP address: 10.0.1.28 Worker port: 10002 Worker PID: 142
(autoscaler +2m0s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m8s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m16s) Adding 1 nodes of type local.cluster.node.

This is a known issue, and we are actively working on removing it rn actually! What’s the version of Ray are you using? If you use the master, it is probably already resolved.

Oh great news thanks! Sorry if there was a bug reference out there that I didn’t find.

I am using Ray 1.4.0. I’ll figure out how to switch to master this weekend and let you know how that goes.

Thank you!

Sounds good! This happened due to pretty complicated issue, so it’ll be great if you can verify our fix resolves your issue :)!

Awesome! BTW @djakubiec you can install master using the latest wheels here: Installing Ray — Ray v2.0.0.dev0

Let us know if it works!

OK, well great news: it looks like the latest code (as of this morning) does in fact fix the SIGBUS failure!

Cluster tasks used to start failing within 5 minutes of job startup. With the latest code it successfully ran through the whole (rather complicated) job and finished successfully in about 60 minutes.

Much thanks, I think this gets us moving forward again.

I think I am going to permanently migrate our cluster to the 2.0 beta. We have a variety of workloads to run so I think I can help report back to you on general stability.

One question… when using the nightly wheels how do I install the equivalent of:

pip install 'ray[default]'

I ran the following command but did not get the latest ray command version or coordinator code:

pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

Thanks again!

@djakubiec that’s great news!

OK, that would be great. Keep in mind we basically use “2.0” as our “dev” branch, and we will probably keep releasing 1.X for a while.

You can just do pip install -U "ray[default] @ https://...(wheel link)".

I hope that works for you! Can I ask what you’re doing with Ray?

We have several different machine learning projects built on sklearn and pandas, and we are looking to better leverage our cluster for these Python-based projects. ~2000 CPU cores spread across ~50 machines and ~40 GPU devices.

Well, so after our initial success we’re running into a different problem now.

We ramped up the script workload a bit and now it is hanging part way through – but I don’t see any obvious errors. Perhaps you could suggest where to look?

  1. We submit 204 tasks via myfunc.remote() and need to process the results sequentially.
  2. On the first attempt it got the first 42 results via ray.get(), but the 43rd one hung.
  3. We killed the job and restarted the cluster and tried again.
  4. On the second attempt of the same job it got the first 46 results via ray.get(), but the 47th one hung.

All nodes appear healthy but now idle with no demands:

======== Autoscaler status: 2021-06-21 07:24:49.593199 ========
Node status
---------------------------------------------------------------
Healthy:
 43 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/1952.0 CPU
 0.0/34.0 GPU
 0.0/3.0 accelerator_type:T
 0.00/2878.554 GiB memory
 19.95/1237.657 GiB object_store_memory

Demands:
 (no resource demands)
2021-06-21 07:24:50,111	INFO monitor.py:224 -- :event_summary:Adding 1 nodes of type local.cluster.node.
2021-06-21 07:24:56,273	INFO autoscaler.py:761 -- StandardAutoscaler: Queue 1 new nodes for launch
2021-06-21 07:24:56,274	INFO node_launcher.py:96 -- NodeLauncher1: Got 1 nodes to launch.
2021-06-21 07:24:56,280	INFO node_launcher.py:96 -- NodeLauncher1: Launching 1 nodes, type local.cluster.node.
2021-06-21 07:24:56,287	INFO node_launcher.py:96 -- NodeLauncher1: No new nodes reported after node creation.
2021-06-21 07:24:57,219	INFO autoscaler.py:354 --

ray memory is showing 1000s of objects:

...
10.0.1.34 | 3207 | Driver | (task call)  | /ceph/home/djakubiec/elcano/submissions/dan6/djlearn.py:raySubmissionThread:311 | /home/ray/anaconda3/lib/python3.7/threading.py:run:870 | /home/ray/anaconda3/lib/python3.7/threading.py:_bootstrap_inner:926 | ?    | LOCAL_REFERENCE | 1301cfed27341981ffffffffffffffffffffffff0100000001000000


10.0.1.34 | 3207 | Driver | (task call)  | /ceph/home/djakubiec/elcano/submissions/dan6/djlearn.py:raySubmissionThread:311 | /home/ray/anaconda3/lib/python3.7/threading.py:run:870 | /home/ray/anaconda3/lib/python3.7/threading.py:_bootstrap_inner:926 | ?    | LOCAL_REFERENCE | 222aaa12d81d57c1ffffffffffffffffffffffff0100000001000000


10.0.1.34 | 3207 | Driver | (task call)  | /ceph/home/djakubiec/elcano/submissions/dan6/djlearn.py:raySubmissionThread:311 | /home/ray/anaconda3/lib/python3.7/threading.py:run:870 | /home/ray/anaconda3/lib/python3.7/threading.py:_bootstra
p_inner:926 | ?    | LOCAL_REFERENCE | 008353944136b198ffffffffffffffffffffffff0100000001000000


10.0.1.34 | 3207 | Driver | (task call)  | /ceph/home/djakubiec/elcano/submissions/dan6/djlearn.py:raySubmissionThread:311 | /home/ray/anaconda3/lib/python3.7/threading.py:run:870 | /home/ray/anaconda3/lib/python3.7/threading.py:_bootstrap_inner:926 | ?    | LOCAL_REFERENCE | ca8e73d256741cd1ffffffffffffffffffffffff0100000001000000
...

ray timeline looked normal(?) until it stopped:

And otherwise I don’t know where else to look. Any suggestions?

And ooo… after I posted that last message I looked at my job output from the stuck job (it was still running but not doing anything) and it just popped out a few dozens exceptions like this from various worker nodes.

SIGABRT this time – first time I’ve seen this error so I’m not sure if this is related to why the job was originally hanging, or whether it is just some other coincidental error:

(autoscaler +1h52m26s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1h52m41s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1h52m55s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1h53m10s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1h53m25s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1h53m41s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1h53m56s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1h54m12s) Adding 1 nodes of type local.cluster.node.
(raylet) [2021-06-21 07:46:17,225 C 212 212] node_manager.cc:165: This node has beem marked as dead.
2021-06-21 07:46:24,475	WARNING worker.py:1121 -- The node with node id: 8ba3c98941ac842f38fe4f8e5f4c476461c0ecbde8584d433ed78f13 and ip: 10.0.1.34 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +2h6s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2h6s) Restarting 42 nodes of type local.cluster.node (lost contact with raylet).
(autoscaler +2h16s) Adding 1 nodes of type local.cluster.node.
(raylet, ip=10.0.1.32) [2021-06-21 07:46:40,357 E 83 134] logging.cc:440: *** Aborted at 1624286800 (unix time) try "date -d @1624286800" if you are using GNU date ***
(raylet, ip=10.0.1.32) [2021-06-21 07:46:40,358 E 83 134] logging.cc:440: PC: @                0x0 (unknown)
(raylet, ip=10.0.1.30) terminate called after throwing an instance of 'std::bad_function_call'
(raylet, ip=10.0.1.30)   what():  bad_function_call
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,908 E 82 133] logging.cc:440: *** Aborted at 1624286800 (unix time) try "date -d @1624286800" if you are using GNU date ***
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,908 E 82 133] logging.cc:440: PC: @                0x0 (unknown)
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,908 E 82 133] logging.cc:440: *** SIGABRT (@0x3e800000052) received by PID 82 (TID 0x7fb00cff9700) from PID 82; stack trace: ***
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,909 E 82 133] logging.cc:440:     @     0x55b8314614cf google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,909 E 82 133] logging.cc:440:     @     0x7fb7a3daa3c0 (unknown)
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,909 E 82 133] logging.cc:440:     @     0x7fb7a389318b gsignal
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,910 E 82 133] logging.cc:440:     @     0x7fb7a3872859 abort
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,910 E 82 133] logging.cc:440:     @     0x7fb7a3af8951 (unknown)
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,910 E 82 133] logging.cc:440:     @     0x7fb7a3b0447c (unknown)
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,911 E 82 133] logging.cc:440:     @     0x7fb7a3b044e7 std::terminate()
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,911 E 82 133] logging.cc:440:     @     0x7fb7a3b04799 __cxa_throw
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,911 E 82 133] logging.cc:440:     @     0x7fb7a3afb590 std::__throw_bad_function_call()
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,911 E 82 133] logging.cc:440:     @     0x55b83102382d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3ray6raylet12AgentManager10StartAgentEvEUlvE0_EEEEE6_M_runEv
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,912 E 82 133] logging.cc:440:     @     0x55b831814020 execute_native_thread_routine
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,912 E 82 133] logging.cc:440:     @     0x7fb7a3d9e609 start_thread
(raylet, ip=10.0.1.30) [2021-06-21 07:46:40,912 E 82 133] logging.cc:440:     @     0x7fb7a396f293 clone
(raylet, ip=10.0.1.40) [2021-06-21 07:46:40,948 E 82 133] logging.cc:440: *** Aborted at 1624286800 (unix time) try "date -d @1624286800" if you are using GNU date ***
(raylet, ip=10.0.1.40) [2021-06-21 07:46:40,949 E 82 133] logging.cc:440: PC: @                0x0 (unknown)
(raylet, ip=10.0.1.33) [2021-06-21 07:46:41,054 E 82 124] logging.cc:440: *** Aborted at 1624286801 (unix time) try "date -d @1624286801" if you are using GNU date ***
(raylet, ip=10.0.1.33) [2021-06-21 07:46:41,054 E 82 124] logging.cc:440: PC: @                0x0 (unknown)
(raylet, ip=10.0.1.104) terminate called after throwing an instance of 'std::bad_function_call'
(raylet, ip=10.0.1.104)   what():  bad_function_call
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,044 E 82 133] logging.cc:440: *** Aborted at 1624286803 (unix time) try "date -d @1624286803" if you are using GNU date ***
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,044 E 82 133] logging.cc:440: PC: @                0x0 (unknown)
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,044 E 82 133] logging.cc:440: *** SIGABRT (@0x3e800000052) received by PID 82 (TID 0x7f624affd700) from PID 82; stack trace: ***
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,044 E 82 133] logging.cc:440:     @     0x5587055884cf google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,045 E 82 133] logging.cc:440:     @     0x7f6a0a0143c0 (unknown)
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,045 E 82 133] logging.cc:440:     @     0x7f6a09afd18b gsignal
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,045 E 82 133] logging.cc:440:     @     0x7f6a09adc859 abort
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,046 E 82 133] logging.cc:440:     @     0x7f6a09d62951 (unknown)
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,046 E 82 133] logging.cc:440:     @     0x7f6a09d6e47c (unknown)
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,046 E 82 133] logging.cc:440:     @     0x7f6a09d6e4e7 std::terminate()
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,047 E 82 133] logging.cc:440:     @     0x7f6a09d6e799 __cxa_throw
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,047 E 82 133] logging.cc:440:     @     0x7f6a09d65590 std::__throw_bad_function_call()
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,047 E 82 133] logging.cc:440:     @     0x55870514a82d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3ray6raylet12AgentManager10StartAgentEvEUlvE0_EEEEE6_M_runEv
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,047 E 82 133] logging.cc:440:     @     0x55870593b020 execute_native_thread_routine
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,048 E 82 133] logging.cc:440:     @     0x7f6a0a008609 start_thread
(raylet, ip=10.0.1.104) [2021-06-21 07:46:43,048 E 82 133] logging.cc:440:     @     0x7f6a09bd9293 clone

Hmm… There are probably other unknown issues since removing SIGBUS required us some protocol changes… Is it possible for you to provide us some self-contained repro? Otherwise, you might want to wait until [WIP] Fix PullManager handling of get requests and liveness issues by ericl · Pull Request #16394 · ray-project/ray · GitHub is merged and tested (and we can follow up again).

1 Like

Sorry, a self-contained repro is hard as the job is rather complex.

I will say that since I last posted I did cure one case of of replicating a huge data structure 204 times for each of the tasks (which I condensed down to 1 shared object_ref). This now avoids the problem for me, though obviously it should have done something more graceful than SIGABRT.

But thanks for the bug pointer, I will keep an eye out for that.

Thank you for all the help!

1 Like

@sangcho So actually I think I do have a good repro for the hang – and I think I do see what is causing it.

Basically what is happening is that we are submitting a bunch of Actors that require 24 CPUs each. Most of our host nodes have either 24 or 72 CPUs, but some have CPU counts that are not multiples of 24, e.g.: a few 32-, 40-, 88-, and 128-count CPU nodes.

Once I submit enough remote()'s to fill up all the available contiguous 24 CPU slots in the cluster, the next submission generates this warning and at that point the cluster stops scheduling the remaining tasks (even once the original ones complete):

WARNING worker.py:1121 -- The actor or task with ID ffffffffffffffff34826751e2ba3336b3a86ec005000000 cannot be scheduled right now. It requires {CPU: 24.000000} for placement, but this node only has remaining {16.000000/40.000000 CPU, 67.158149 GiB/67.158149 GiB memory, 28.782064 GiB/28.782064 GiB object_store_memory, 1.000000/1.000000 node:10.0.1.40}
. In total there are 0 pending tasks and 127 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.

Once I see that warning I know it will hang forever.

Perhaps it is peculiar to having available CPUs in the cluster, but not enough CPUs on any one worker to submit the requested task.

Here is the code to reproduce it:

import logging as log
logFormat=f'[%(asctime)s %(levelname)-7s] %(message)s'
log.basicConfig(format=logFormat, level=log.INFO)

import time
import ray

ray.init(address='auto')
log.info('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote(num_cpus=24)
class Actor:
    def run(self):
        time.sleep(20)
        return None

log.info(f"Instantiating actors.")
actors = [Actor.remote() for _ in range(204)]

log.info(f"Submitting tasks.")
tasks = [actor.run.remote() for actor in actors]

log.info(f"Waiting for {len(tasks)} tasks.")
for taskNumber, task in enumerate(tasks, 1):
    log.info(f"Waiting for task {taskNumber}.")
    ray.get(task)

log.info(f"Complete.")
raise SystemExit()

And here are my corresponding outputs:

⇒  ray submit cluster.yaml rayTest.py

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 10.0.1.34
Shared connection to 10.0.1.34 closed.
Shared connection to 10.0.1.34 closed.
Fetched IP: 10.0.1.34
Shared connection to 10.0.1.34 closed.
2021-06-23 07:28:16,736	INFO worker.py:734 -- Connecting to existing Ray cluster at address: 10.0.1.34:6379
[2021-06-23 07:28:16,904 INFO   ] This cluster consists of
    43 nodes in total
    1952.0 CPU resources in total

[2021-06-23 07:28:16,905 INFO   ] Instantiating actors.
[2021-06-23 07:28:17,083 INFO   ] Submitting tasks.
[2021-06-23 07:28:17,092 INFO   ] Waiting for 204 tasks.
[2021-06-23 07:28:17,093 INFO   ] Waiting for task 1.
(autoscaler +8s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +8s) Adding 1 nodes of type local.cluster.node.
(autoscaler +16s) Adding 1 nodes of type local.cluster.node.
(autoscaler +23s) Adding 1 nodes of type local.cluster.node.
[2021-06-23 07:28:43,950 INFO   ] Waiting for task 2.
[2021-06-23 07:28:44,081 INFO   ] Waiting for task 3.
[2021-06-23 07:28:44,131 INFO   ] Waiting for task 4.
[2021-06-23 07:28:44,181 INFO   ] Waiting for task 5.
[2021-06-23 07:28:44,241 INFO   ] Waiting for task 6.
2021-06-23 07:28:45,343	WARNING worker.py:1121 -- The actor or task with ID ffffffffffffffff34826751e2ba3336b3a86ec005000000 cannot be scheduled right now. It requires {CPU: 24.000000} for placement, but this node only has remaining {16.000000/40.000000 CPU, 67.158149 GiB/67.158149 GiB memory, 28.782064 GiB/28.782064 GiB object_store_memory, 1.000000/1.000000 node:10.0.1.40}
. In total there are 0 pending tasks and 127 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
[2021-06-23 07:28:46,241 INFO   ] Waiting for task 7.
[2021-06-23 07:28:46,259 INFO   ] Waiting for task 8.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 9.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 10.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 11.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 12.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 13.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 14.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 15.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 16.
[2021-06-23 07:28:46,332 INFO   ] Waiting for task 17.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 18.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 19.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 20.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 21.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 22.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 23.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 24.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 25.
[2021-06-23 07:28:46,333 INFO   ] Waiting for task 26.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 27.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 28.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 29.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 30.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 31.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 32.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 33.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 34.
[2021-06-23 07:28:46,334 INFO   ] Waiting for task 35.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 36.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 37.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 38.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 39.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 40.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 41.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 42.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 43.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 44.
[2021-06-23 07:28:46,335 INFO   ] Waiting for task 45.
[2021-06-23 07:28:46,336 INFO   ] Waiting for task 46.
[2021-06-23 07:28:46,336 INFO   ] Waiting for task 47.
[2021-06-23 07:28:46,336 INFO   ] Waiting for task 48.
[2021-06-23 07:28:46,336 INFO   ] Waiting for task 49.
[2021-06-23 07:28:46,336 INFO   ] Waiting for task 50.
[2021-06-23 07:28:46,336 INFO   ] Waiting for task 51.
[2021-06-23 07:28:46,336 INFO   ] Waiting for task 52.
(autoscaler +31s) Adding 1 nodes of type local.cluster.node.
[2021-06-23 07:28:48,985 INFO   ] Waiting for task 53.
[2021-06-23 07:28:49,035 INFO   ] Waiting for task 54.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 55.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 56.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 57.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 58.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 59.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 60.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 61.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 62.
[2021-06-23 07:28:49,093 INFO   ] Waiting for task 63.
[2021-06-23 07:28:49,094 INFO   ] Waiting for task 64.
[2021-06-23 07:28:49,094 INFO   ] Waiting for task 65.
[2021-06-23 07:28:49,094 INFO   ] Waiting for task 66.
[2021-06-23 07:28:49,160 INFO   ] Waiting for task 67.
[2021-06-23 07:28:49,160 INFO   ] Waiting for task 68.
[2021-06-23 07:28:49,160 INFO   ] Waiting for task 69.
[2021-06-23 07:28:49,160 INFO   ] Waiting for task 70.
[2021-06-23 07:28:49,160 INFO   ] Waiting for task 71.
[2021-06-23 07:28:49,161 INFO   ] Waiting for task 72.
[2021-06-23 07:28:49,161 INFO   ] Waiting for task 73.
[2021-06-23 07:28:49,161 INFO   ] Waiting for task 74.
[2021-06-23 07:28:49,161 INFO   ] Waiting for task 75.
[2021-06-23 07:28:49,161 INFO   ] Waiting for task 76.
[2021-06-23 07:28:49,161 INFO   ] Waiting for task 77.
[2021-06-23 07:28:49,200 INFO   ] Waiting for task 78.
(autoscaler +39s) Adding 1 nodes of type local.cluster.node.
(autoscaler +47s) Adding 1 nodes of type local.cluster.node.
(autoscaler +55s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m3s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m11s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m19s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m27s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m35s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m43s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m51s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m59s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m7s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m15s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m23s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m31s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m39s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m47s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m55s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m3s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m11s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m19s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m27s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m35s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m43s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m51s) Adding 1 nodes of type local.cluster.node.
(autoscaler +3m59s) Adding 1 nodes of type local.cluster.node.
(autoscaler +4m7s) Adding 1 nodes of type local.cluster.node.
(autoscaler +4m15s) Adding 1 nodes of type local.cluster.node.
(autoscaler +4m23s) Adding 1 nodes of type local.cluster.node.
(autoscaler +4m31s) Adding 1 nodes of type local.cluster.node.
(autoscaler +4m39s) Adding 1 nodes of type local.cluster.node.
(autoscaler +4m47s) Adding 1 nodes of type local.cluster.node.
(autoscaler +4m55s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m2s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m10s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m18s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m26s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m34s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m42s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m50s) Adding 1 nodes of type local.cluster.node.
(autoscaler +5m58s) Adding 1 nodes of type local.cluster.node.
(autoscaler +6m6s) Adding 1 nodes of type local.cluster.node.
(autoscaler +6m14s) Adding 1 nodes of type local.cluster.node.
(autoscaler +6m22s) Adding 1 nodes of type local.cluster.node.
(autoscaler +6m30s) Adding 1 nodes of type local.cluster.node.
(autoscaler +6m37s) Adding 1 nodes of type local.cluster.node.
(autoscaler +6m45s) Adding 1 nodes of type local.cluster.node.
(autoscaler +6m53s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m1s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m9s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m17s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m24s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m32s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m40s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m48s) Adding 1 nodes of type local.cluster.node.
(autoscaler +7m56s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m4s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m12s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m20s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m27s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m35s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m43s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m51s) Adding 1 nodes of type local.cluster.node.
(autoscaler +8m59s) Adding 1 nodes of type local.cluster.node.
(autoscaler +9m7s) Adding 1 nodes of type local.cluster.node.
(autoscaler +9m15s) Adding 1 nodes of type local.cluster.node.
(autoscaler +9m22s) Adding 1 nodes of type local.cluster.node.
(autoscaler +9m30s) Adding 1 nodes of type local.cluster.node.
(autoscaler +9m38s) Adding 1 nodes of type local.cluster.node.
(autoscaler +9m46s) Adding 1 nodes of type local.cluster.node.
(autoscaler +9m54s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m2s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m9s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m17s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m25s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m33s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m41s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m49s) Adding 1 nodes of type local.cluster.node.
(autoscaler +10m57s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m5s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m12s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m20s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m28s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m36s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m44s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m52s) Adding 1 nodes of type local.cluster.node.
(autoscaler +11m59s) Adding 1 nodes of type local.cluster.node.
(autoscaler +12m7s) Adding 1 nodes of type local.cluster.node.
(autoscaler +12m15s) Adding 1 nodes of type local.cluster.node.
(autoscaler +12m23s) Adding 1 nodes of type local.cluster.node.
(autoscaler +12m31s) Adding 1 nodes of type local.cluster.node.
(autoscaler +12m39s) Adding 1 nodes of type local.cluster.node.
(autoscaler +12m46s) Adding 1 nodes of type local.cluster.node.
(autoscaler +12m54s) Adding 1 nodes of type local.cluster.node.
(autoscaler +13m2s) Adding 1 nodes of type local.cluster.node.
...

And here is the signature status output many minutes after all processing stops:

======== Autoscaler status: 2021-06-23 07:51:36.161905 ========
Node status
---------------------------------------------------------------
Healthy:
 43 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 1848.0/1952.0 CPU
 0.0/34.0 GPU
 0.0/3.0 accelerator_type:T
 0.00/2878.548 GiB memory
 0.00/1237.655 GiB object_store_memory

Demands:
 {'CPU': 24.0}: 127+ pending tasks/actors
2021-06-23 07:51:36,721	INFO monitor.py:224 -- :event_summary:Adding 1 nodes of type local.cluster.node.
2021-06-23 07:51:42,940	INFO autoscaler.py:761 -- StandardAutoscaler: Queue 1 new nodes for launch
2021-06-23 07:51:42,941	INFO node_launcher.py:96 -- NodeLauncher1: Got 1 nodes to launch.
2021-06-23 07:51:42,947	INFO node_launcher.py:96 -- NodeLauncher1: Launching 1 nodes, type local.cluster.node.
2021-06-23 07:51:42,956	INFO node_launcher.py:96 -- NodeLauncher1: No new nodes reported after node creation.
2021-06-23 07:51:43,918	INFO autoscaler.py:354 --
======== Autoscaler status: 2021-06-23 07:51:43.918594 ========
Node status
---------------------------------------------------------------
Healthy:
 43 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

I am guessing you could reproduce with a smaller test cluster by choosing some irregular CPU topology like: 2 CPUs per task, 1 worker with 2 CPUs and 1 worker with 3 CPUs.

To follow up here, I think we figured out the issues. Multiple different things:

  1. In the repro above, the actual issue there is that the actors don’t get freed as the results arrive and continue to hold onto the CPUs. So I think that is an invalid test case sorry. But this is a little different from what is happening in our application…

  2. In our application we do release the actor handles as the responses arrive… but we are trying to assemble the results in order so we are using ray.get() instead of ray.wait(). This can lead to a subtle deadlock situation where the “next” sequential result ends up be queued behind completed results that are still occupying CPUs waiting for their actor handle to be freed.

So I think the solution in our case is to ditch the class-based actors, switch to function-based Ray calls, switch to ray.wait(), and then re-assemble the results in order on the head node on our own.

Sorry for any misdirection there.

1 Like

@sangcho @rliaw Quick question: the master commit that fixed this issue for me was f5f23448fcab7c896e478e9e5c9804a73688ec41.

Were the fixes you referenced included in the subsequent v1.4.1 release?

I’m running into other unrelated clustering issues now and would like to re-stabilize on a released version before reporting anything. Thanks!

1.4.1 includes only the minimal low-hanging fruit changes, meaning this change won’t be included (since this requires some rearchitecturing). I think the change will be included in 1.5.0. Do you mind sharing what are other cluster issues and if they are addressed? (We’d love to take a look at it if not. The stability of Ray core is the #1 priority for us)

@sangcho I posted the issue to a new thread Periodic _MultiThreadedRendezvous failure leaves cluster in damaged state and mentioned you there. Thanks!