How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
It is probably pretty hard to reproduce this error but sometimes when nodes disconnect and reconnect in rapid succession (I was stress testing some fault tolerance mechanisms), you will encounter this error: ubuntu@ip-172-31-57-217:~/16-quokka/quokka$ python3 apps/tpc-h/tpch.py
(raylet, ip=172.31.62.47) *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,052 E 19414 19462] logging.cc:361: @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,053 E 19414 19462] logging.cc:361: @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) Fatal Python error: Segmentation fault
Just FYI, maybe it might ring a bell to the right people.