Very rare error that occurs when nodes disconnect and then reconnect

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

It is probably pretty hard to reproduce this error but sometimes when nodes disconnect and reconnect in rapid succession (I was stress testing some fault tolerance mechanisms), you will encounter this error: ubuntu@ip-172-31-57-217:~/16-quokka/quokka$ python3 apps/tpc-h/tpch.py
(raylet, ip=172.31.62.47) *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,052 E 19414 19462] logging.cc:361: @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,053 E 19414 19462] logging.cc:361: @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) Fatal Python error: Segmentation fault

Just FYI, maybe it might ring a bell to the right people.

Thanks @marsupialtail !

This looks actually related to some of other issues we are running into. (And this seems to have a much clearer stacktrace than the other ones we got)

How do you do “disconnect and reconnect in rapid succession”? I would love to do something similar and see if I could repro this.

Have ray cluster, and go on one worker node when a job is running. Now ray stop. Then after job completes try connect worker node again and relaunch job.

Thank you! Will look into it. Tracking issue here: [core] Segfault happens when continuously discconnect and reconnect ray node · Issue #34637 · ray-project/ray · GitHub

This is on nightly? Or ray 2.3?