Very rare error that occurs when nodes disconnect and then reconnect

marsupialtail · April 17, 2023, 2:08am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

It is probably pretty hard to reproduce this error but sometimes when nodes disconnect and reconnect in rapid succession (I was stress testing some fault tolerance mechanisms), you will encounter this error: ubuntu@ip-172-31-57-217:~/16-quokka/quokka$ python3 apps/tpc-h/tpch.py
(raylet, ip=172.31.62.47) *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,052 E 19414 19462] logging.cc:361: @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,053 E 19414 19462] logging.cc:361: @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) Fatal Python error: Segmentation fault

Just FYI, maybe it might ring a bell to the right people.

rickyyx · April 19, 2023, 8:34pm

Thanks @marsupialtail !

This looks actually related to some of other issues we are running into. (And this seems to have a much clearer stacktrace than the other ones we got)

How do you do “disconnect and reconnect in rapid succession”? I would love to do something similar and see if I could repro this.

marsupialtail · April 19, 2023, 8:48pm

Have ray cluster, and go on one worker node when a job is running. Now ray stop. Then after job completes try connect worker node again and relaunch job.

rickyyx · April 20, 2023, 11:42pm

Thank you! Will look into it. Tracking issue here: [core] Segfault happens when continuously discconnect and reconnect ray node · Issue #34637 · ray-project/ray · GitHub

This is on nightly? Or ray 2.3?

Topic		Replies	Views
Some confusion about raylet.out Ray Core	2	268	April 18, 2024
Periodic _MultiThreadedRendezvous failure leaves cluster in damaged state Ray Core	7	1824	December 10, 2021
Problem connecting client to cluster Ray Core	4	1279	April 30, 2024
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1803	December 15, 2020
Remote Worker Nodes die after a few seconds Ray Clusters	5	2153	July 17, 2024

Very rare error that occurs when nodes disconnect and then reconnect

Related topics