How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Ray 2.2.0
I am using TorchTrainer on 2 nodes each of them has 4 gpus
During training my cluster crashes
with the following error in log
(raylet) [2023-03-06 13:22:39,724 C 1467 1550] (raylet) store.cc:281: Check failed: entry != nullptr
(raylet) *** StackTrace Information ***
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4fa17a) [0x55b7a0c5f17a] ray::operator<<()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4fbc52) [0x55b7a0c60c52] ray::SpdLogMessage::Flush()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4fbf67) [0x55b7a0c60f67] ray::RayLog::~RayLog()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x31f6cd) [0x55b7a0a846cd] plasma::PlasmaStore::ReleaseObject()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x323a09) [0x55b7a0a88a09] plasma::PlasmaStore::ProcessMessage()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x31aef5) [0x55b7a0a7fef5] std::_Function_handler<>::_M_invoke()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x342163) [0x55b7a0aa7163] std::_Function_handler<>::_M_invoke()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x49c3d1) [0x55b7a0c013d1] ray::ClientConnection::ProcessMessage()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4ddf96) [0x55b7a0c42f96] EventTracker::RecordExecution()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x496067) [0x55b7a0bfb067] boost::asio::detail::binder2<>::operator()()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x496738) [0x55b7a0bfb738] boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa4a87b) [0x55b7a11af87b] boost::asio::detail::scheduler::do_run_one()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa4c041) [0x55b7a11b1041] boost::asio::detail::scheduler::run()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa4c270) [0x55b7a11b1270] boost::asio::io_context::run()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x31a118) [0x55b7a0a7f118] plasma::PlasmaStoreRunner::Start()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x2b6395) [0x55b7a0a1b395] std::thread::_State_impl<>::_M_run()
(raylet) /opt/conda/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa9b050) [0x55b7a1200050] execute_native_thread_routine
(raylet) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f37fd2d9609] start_thread
(raylet) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f37fcea6133] __clone
(raylet)
Any hints how to understand the reason?
Thanks.