Raylet crashes suddenly during training

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Ray 2.2.0
I am using TorchTrainer on 2 nodes each of them has 4 gpus
During training my cluster crashes
with the following error in log

(raylet) [2023-03-06 13:22:39,724 C 1467 1550] (raylet) store.cc:281:  Check failed: entry != nullptr 
(raylet) *** StackTrace Information ***
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4fa17a) [0x55b7a0c5f17a] ray::operator<<()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4fbc52) [0x55b7a0c60c52] ray::SpdLogMessage::Flush()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4fbf67) [0x55b7a0c60f67] ray::RayLog::~RayLog()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x31f6cd) [0x55b7a0a846cd] plasma::PlasmaStore::ReleaseObject()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x323a09) [0x55b7a0a88a09] plasma::PlasmaStore::ProcessMessage()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x31aef5) [0x55b7a0a7fef5] std::_Function_handler<>::_M_invoke()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x342163) [0x55b7a0aa7163] std::_Function_handler<>::_M_invoke()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49c3d1) [0x55b7a0c013d1] ray::ClientConnection::ProcessMessage()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4ddf96) [0x55b7a0c42f96] EventTracker::RecordExecution()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x496067) [0x55b7a0bfb067] boost::asio::detail::binder2<>::operator()()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x496738) [0x55b7a0bfb738] boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa4a87b) [0x55b7a11af87b] boost::asio::detail::scheduler::do_run_one()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa4c041) [0x55b7a11b1041] boost::asio::detail::scheduler::run()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa4c270) [0x55b7a11b1270] boost::asio::io_context::run()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x31a118) [0x55b7a0a7f118] plasma::PlasmaStoreRunner::Start()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x2b6395) [0x55b7a0a1b395] std::thread::_State_impl<>::_M_run()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa9b050) [0x55b7a1200050] execute_native_thread_routine
(raylet) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f37fd2d9609] start_thread
(raylet) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f37fcea6133] __clone
(raylet) 

Any hints how to understand the reason?

Thanks.

Hi @igolant,

Check failure definitely means that there is a bug in our codebase.

Is it possible for you to share a simple reproduce script so that I can debug the issue?

Also can you try the latest 2.3.0 and see if you still have the issue?

Hi,
in 2.3.0 it seems to work fine.
Thanks!

1 Like