Ray head crashed silently

Context:
How severe: High
Case: raycluster + raydata + rayjob to create distributed inference task
Depends: python3.10.13, ray2.34.0
Problem description: ray head container exits and restart occasionally when submitting jobs to raycluster using curl. This will cause the job submission failed and also the failures of all running jobs.
Any logs: In general: None. No errors or exceptions in both .out and .err files except: The node with node id: 61d503aa6ca8f1753c9dd8c9d93fcb5ff915850197604ec5f7296526 and address: 172.22.3.35 and node name: 172.22.3.35 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 2 (2) raylet has lagging heartbeats due to slow network or busy workload. in raylet.out. But I guess this is just a red herring.
Preliminary Investigation: I guess the root cause is the crash of gcs server. But there is no failure message shown in the gcs_server.out. The log basically just shows: [2024-08-30 05:03:56,409 I 27 27] (gcs_server) gcs_actor_manager.cc:1340: Actor created successfully job_id=0e000000 actor_id=42b3aab25a34ba665c6d303f0e000000 [2024-08-30 05:03:56,410 I 27 27] (gcs_server) gcs_actor_manager.cc:357: Finished creating actor. Status: OK job_id=0e000000 actor_id=42b3aab25a34ba665c6d303f0e000000 and ends. Therefore, we cannot validate it and also, we have no idea why it crashes. We basically ruled out the OOM issue given when the crash happens there are plenty of available bytes according to the cluster metrics.
Extra: 1. For FT, we used redis as backend for GCS, we have guessed the connection to redis might be the cause, but still we cannot verify it given there is no error log. 2. From our observation, it happens quite definite when the cluster is idled for a while (probably half an hour) and accept a job submission all of a sudden.

can you supply instructions on how to reproduce this?

more info: 1. https://github.com/ray-project/ray/blob/master/python/ray/scripts/scripts.py`

            with cli_logger.indented():
                for process_type, process in unexpected_deceased:
                    cli_logger.error(
                        "{}",
                        cf.bold(str(process_type)),
                        _tags={"exit code": str(process.returncode)},
                    )

            cli_logger.newline()
            cli_logger.error("Remaining processes will be killed.")
            # explicitly kill all processes since atexit handlers
            # will not exit with errors.
            node.kill_all_processes(check_alive=False, allow_graceful=False)
            #### fix ####
            cli_logger.flush()
            #### fix ####
            os._exit(1)`

after adding cli_logger.flush() before os._exit(1), I was able to catch the exit code and err message: Some Ray subprocesses exited unexpectedly: gcs_server exit code=-13
2. To replicate the error, we created a sandbox ray-cluster and use script to submit job with random time interval (from 5min to 2hr). Because if we submit the job with fixed time interval, it won’t come out.
3. We suspect the error cause is the redis-server we used, which is from our cloud provider hence we created another controlled experiment with ray cluster using on-premise redis-server. And it turned out using on-premise redis-server won’t raise any problem.

after adding couple line of code in .py and .cc and rebuild ray, we are able to catch the error code and stack trace as follows:

src/ray/gcs/redis_client.cc:82 (PID: 28, TID: 28, errno: 32 (Broken pipe)): Check failed: reply *** StackTrace Information *** /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x7f0627) [0x55c83bac4627] ray::operator<<() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x7f4a70) [0x55c83bac8a70] ray::RayLog::~RayLog() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x57da38) [0x55c83b851a38] ray::gcs::RedisClient::GetNextJobID() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x46949d) [0x55c83b73d49d] ray::gcs::GcsJobManager::HandleGetNextJobID() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x34cb5c) [0x55c83b620b5c] std::_Function_handler<>::_M_invoke() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x6e653f) [0x55c83b9ba53f] EventTracker::RecordExecution() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x6e1a9f) [0x55c83b9b5a9f] std::_Function_handler<>::_M_invoke() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x6e1f4c) [0x55c83b9b5f4c] boost::asio::detail::completion_handler<>::do_complete() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x807384) [0x55c83badb384] boost::asio::detail::scheduler::do_run_one() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x808815) [0x55c83badc815] boost::asio::detail::scheduler::run() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x80a816) [0x55c83bade816] boost::asio::io_context::run() /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x208046) [0x55c83b4dc046] main /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7fd911a0424a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fd911a04305] __libc_start_main /nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x267f2e) [0x55c83b53bf2e] _start

We are wondering if this is related to the issue using hiredis: redis/hiredis#910

Which Cloud Provider are you using?

Mostly using nodes from Amazon Web Service

We have finally targeted the cause: Our redis server has a 3600s timeout config which means if the connection is idled over an hour the server will cut the line and client will receive a SIGPIPE signal, which will stop the process in most cases if not being handled properly. After set the timeout to 0, the crash stopped.

1 Like