Hello, I am submitting a big job to my Ray cluster (43 nodes, 1952 CPUs). The job always runs for a few minutes and then workers start failing with the following exceptions and stack dumps.
It is reporting a SIGBUS in plasma due to some memory allocation issue. I can reproduce 100% of the time.
What is the best way to get help with chasing this down? The project would be too complex to try to replicate elsewhere, but I am happy to provide useful debug information and help with testing fixes.
Thanks!
2021-06-18 14:27:21,029 WARNING worker.py:1114 -- Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 712, in ray._raylet.restore_spilled_objects_handler
File "python/ray/_raylet.pyx", line 715, in ray._raylet.restore_spilled_objects_handler
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 535, in restore_spilled_objects
url_with_offset_list)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 307, in restore_spilled_objects
owner_address)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 102, in _put_object_to_store
object_ref, owner_address)
File "python/ray/_raylet.pyx", line 1081, in ray._raylet.CoreWorker.put_file_like_object
OSError: [Errno 14] Bad address
An unexpected internal error occurred while the IO worker was restoring spilled objects.
(autoscaler +1m21s) Adding 1 nodes of type local.cluster.node.
[2021-06-18 14:27:22,088 INFO ] Pipeline ETR50 step 1 = ExtraTreesRegressor {'min_samples_leaf': 50, 'n_estimators': 1000, 'n_jobs': 24}
(raylet) [2021-06-18 14:27:22,856 E 212 228] logging.cc:441: *** Aborted at 1624051642 (unix time) try "date -d @1624051642" if you are using GNU date ***
(raylet) [2021-06-18 14:27:22,857 E 212 228] logging.cc:441: PC: @ 0x0 (unknown)
(raylet) [2021-06-18 14:27:22,858 E 212 228] logging.cc:441: *** SIGBUS (@0x7fba7ff38108) received by PID 212 (TID 0x7fbaadffb700) from PID 2146664712; stack trace: ***
(raylet) [2021-06-18 14:27:22,858 E 212 228] logging.cc:441: @ 0x55bfbbbf8c4f google::(anonymous namespace)::FailureSignalHandler()
(raylet) [2021-06-18 14:27:22,858 E 212 228] logging.cc:441: @ 0x7fbabf1a33c0 (unknown)
(raylet) [2021-06-18 14:27:22,859 E 212 228] logging.cc:441: @ 0x55bfbb8f8bf2 dlmalloc
(raylet) [2021-06-18 14:27:22,859 E 212 228] logging.cc:441: @ 0x55bfbb8f9af0 plasma::internal_memalign()
(raylet) [2021-06-18 14:27:22,860 E 212 228] logging.cc:441: @ 0x55bfbb8e0b41 plasma::PlasmaAllocator::Memalign()
(raylet) [2021-06-18 14:27:22,860 E 212 228] logging.cc:441: @ 0x55bfbb8ed76e plasma::PlasmaStore::AllocateMemory()
(raylet) [2021-06-18 14:27:22,861 E 212 228] logging.cc:441: @ 0x55bfbb8edd0b plasma::PlasmaStore::CreateObject()
(raylet) [2021-06-18 14:27:22,861 E 212 228] logging.cc:441: @ 0x55bfbb8ee16d plasma::PlasmaStore::HandleCreateObjectRequest()
(raylet) [2021-06-18 14:27:22,862 E 212 228] logging.cc:441: @ 0x55bfbb8f2579 plasma::CreateRequestQueue::ProcessRequest()
(raylet) [2021-06-18 14:27:22,863 E 212 228] logging.cc:441: @ 0x55bfbb8f3ee6 plasma::CreateRequestQueue::ProcessRequests()
(raylet) [2021-06-18 14:27:22,863 E 212 228] logging.cc:441: @ 0x55bfbb8e7626 plasma::PlasmaStore::ProcessCreateRequests()
(raylet) [2021-06-18 14:27:22,864 E 212 228] logging.cc:441: @ 0x55bfbb8f017a plasma::PlasmaStore::ProcessMessage()
(raylet) [2021-06-18 14:27:22,864 E 212 228] logging.cc:441: @ 0x55bfbb8e21ff std::_Function_handler<>::_M_invoke()
(raylet) [2021-06-18 14:27:22,865 E 212 228] logging.cc:441: @ 0x55bfbb901d86 _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionEElRKSt6vectorIhSaIhEEEZN6plasma6Client6CreateESt8functionIFNS1_6StatusES0_ISB_ENSA_7flatbuf11MessageTypeES8_EEON5boost4asio19basic_stream_socketINSK_7generic15stream_protocolENSK_8executorEEEEUlS3_lS8_E_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
(raylet) [2021-06-18 14:27:22,866 E 212 228] logging.cc:441: @ 0x55bfbbbbb672 ray::ClientConnection::ProcessMessage()
(raylet) [2021-06-18 14:27:22,866 E 212 228] logging.cc:441: @ 0x55bfbbbb81a8 boost::asio::detail::read_op<>::operator()()
(raylet) [2021-06-18 14:27:22,867 E 212 228] logging.cc:441: @ 0x55bfbbbb855b boost::asio::detail::executor_function<>::do_complete()
(raylet) [2021-06-18 14:27:22,868 E 212 228] logging.cc:441: @ 0x55bfbb78d5a0 boost::asio::io_context::executor_type::dispatch<>()
(raylet) [2021-06-18 14:27:22,869 E 212 228] logging.cc:441: @ 0x55bfbbbb8f53 boost::asio::executor::dispatch<>()
(raylet) [2021-06-18 14:27:22,870 E 212 228] logging.cc:441: @ 0x55bfbbbb9148 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet) [2021-06-18 14:27:22,871 E 212 228] logging.cc:441: @ 0x55bfbbf6e011 boost::asio::detail::scheduler::do_run_one()
(raylet) [2021-06-18 14:27:22,873 E 212 228] logging.cc:441: @ 0x55bfbbf6e141 boost::asio::detail::scheduler::run()
(raylet) [2021-06-18 14:27:22,873 E 212 228] logging.cc:441: @ 0x55bfbbf70340 boost::asio::io_context::run()
(raylet) [2021-06-18 14:27:22,874 E 212 228] logging.cc:441: @ 0x55bfbb8e13fd plasma::PlasmaStoreRunner::Start()
(raylet) [2021-06-18 14:27:22,874 E 212 228] logging.cc:441: @ 0x55bfbb88e697 std::thread::_State_impl<>::_M_run()
(raylet) [2021-06-18 14:27:22,875 E 212 228] logging.cc:441: @ 0x55bfbbfab020 execute_native_thread_routine
(raylet) [2021-06-18 14:27:22,875 E 212 228] logging.cc:441: @ 0x7fbabf197609 start_thread
(raylet) [2021-06-18 14:27:22,875 E 212 228] logging.cc:441: @ 0x7fbabed68293 clone
(autoscaler +1m28s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m36s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m44s) Adding 1 nodes of type local.cluster.node.
(autoscaler +1m52s) Adding 1 nodes of type local.cluster.node.
2021-06-18 14:27:52,116 WARNING worker.py:1114 -- The node with node id: 58da865a3ddbee6443f5f68c504a1a0a39a14df87d1a149ba7391497 and ip: 10.0.1.34 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(pid=142, ip=10.0.1.28) 2021-06-18 14:27:52,205 ERROR worker.py:409 -- SystemExit was raised from the worker
(pid=142, ip=10.0.1.28) Traceback (most recent call last):
(pid=142, ip=10.0.1.28) File "python/ray/_raylet.pyx", line 595, in ray._raylet.task_execution_handler
(pid=142, ip=10.0.1.28) File "python/ray/_raylet.pyx", line 453, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28) File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28) File "python/ray/_raylet.pyx", line 497, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28) File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
(pid=142, ip=10.0.1.28) return function(*args, **kwargs)
(pid=142, ip=10.0.1.28) File "/ceph/home/djakubiec/elcano/submissions/dan6/djlearn.py", line 411, in processGroupTrain
(pid=142, ip=10.0.1.28) model.pipeline.fit(groupResult['modelData'].train.X, groupResult['modelData'].train.yTrue)
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py", line 346, in fit
(pid=142, ip=10.0.1.28) self._final_estimator.fit(Xt, y, **fit_params_last_step)
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 393, in fit
(pid=142, ip=10.0.1.28) for i, t in enumerate(trees))
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1054, in __call__
(pid=142, ip=10.0.1.28) self.retrieve()
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 933, in retrieve
(pid=142, ip=10.0.1.28) self._output.extend(job.get(timeout=self.timeout))
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
(pid=142, ip=10.0.1.28) self.wait(timeout)
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
(pid=142, ip=10.0.1.28) self._event.wait(timeout)
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/threading.py", line 552, in wait
(pid=142, ip=10.0.1.28) signaled = self._cond.wait(timeout)
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/threading.py", line 296, in wait
(pid=142, ip=10.0.1.28) waiter.acquire()
(pid=142, ip=10.0.1.28) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 406, in sigterm_handler
(pid=142, ip=10.0.1.28) sys.exit(1)
(pid=142, ip=10.0.1.28) SystemExit: 1
2021-06-18 14:27:52,270 WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: e11fe2800445c79affffffffffffffffffffffff01000000 Worker ID: 34df4752ff905287f07553d1f4d3ec1cbd1dfe5fba57f51efd665533 Node ID: 8d7cff80b5bc5b838a93c495e569845209e6818740a66977172e223d Worker IP address: 10.0.1.28 Worker port: 10002 Worker PID: 142
(autoscaler +2m0s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m8s) Adding 1 nodes of type local.cluster.node.
(autoscaler +2m16s) Adding 1 nodes of type local.cluster.node.