Context:
How severe: High
Case: raycluster + raydata + rayjob to create distributed inference task
Depends: python3.10.13, ray2.34.0
Problem description: ray head container exits and restart occasionally when submitting jobs to raycluster using curl. This will cause the job submission failed and also the failures of all running jobs.
Any logs: In general: None. No errors or exceptions in both .out and .err files except: The node with node id: 61d503aa6ca8f1753c9dd8c9d93fcb5ff915850197604ec5f7296526 and address: 172.22.3.35 and node name: 172.22.3.35 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 2 (2) raylet has lagging heartbeats due to slow network or busy workload.
in raylet.out. But I guess this is just a red herring.
Preliminary Investigation: I guess the root cause is the crash of gcs server. But there is no failure message shown in the gcs_server.out. The log basically just shows: [2024-08-30 05:03:56,409 I 27 27] (gcs_server) gcs_actor_manager.cc:1340: Actor created successfully job_id=0e000000 actor_id=42b3aab25a34ba665c6d303f0e000000 [2024-08-30 05:03:56,410 I 27 27] (gcs_server) gcs_actor_manager.cc:357: Finished creating actor. Status: OK job_id=0e000000 actor_id=42b3aab25a34ba665c6d303f0e000000
and ends. Therefore, we cannot validate it and also, we have no idea why it crashes. We basically ruled out the OOM issue given when the crash happens there are plenty of available bytes according to the cluster metrics.
Extra: 1. For FT, we used redis as backend for GCS, we have guessed the connection to redis might be the cause, but still we cannot verify it given there is no error log. 2. From our observation, it happens quite definite when the cluster is idled for a while (probably half an hour) and accept a job submission all of a sudden.
can you supply instructions on how to reproduce this?