Head node failed to connect to all its worker nodes

Yuneetha_Alaparthi · October 6, 2023, 9:58am

Hi,
First,I was unable to connect to my head node using SSH for sometime. When I have tried to check the status of the cluster using ray status at this time, I got the following error

ray.exceptions.RpcError: failed to connect to all addresses

After getting back the connection with head node, I was able to get the status of the cluster but head node was unable to connect to all of its worker nodes
Below is the information from the dashboard logs of a job

Job supervisor actor could not be scheduled: The actor is not schedulable: The node specified via NodeAffinitySchedulingStrategy doesn’t exist any more or is infeasible, and soft=False was specified.

FYI : I am using ray version 2.6.0
Can you help me to fix this error?

Yuneetha_Alaparthi · October 6, 2023, 10:24am

I have also checked gcs_server.out file and here is the information from it.

Failed to read the message from: f5619e77d999b4a1bfd01e79f49c1fe33daad8348483bcbf3910a131
Drop message received from 765a8b876d814c9412a597e92cd25d8e1bbec40bc293020ad3dde407 because the message version 0 is older than the local version 0. Message type: 0
gcs_worker_manager.cc:55: Reporting worker exit, worker id = 04000000ffffffffffffffffffffffffffffffffffffffffffffffff, node id = ffffffffffffffffffffffffffffffffffffffffffffffffffffffff, address = , exit_type = SYSTEM_ERROR, exit_detail = Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors… Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
gcs_actor_manager.cc:961: Worker 04000000ffffffffffffffffffffffffffffffffffffffffffffffff on node 5a51498ed849055785327f21262d1e128dc229a9ed2e468deb0bb017 exits, type=SYSTEM_ERROR, has creation_task_exception = 0
gcs_job_manager.cc:87: Finished marking job state, job id = 04000000
gcs_node_manager.cc:99: Draining node info, node id = 5a51498ed849055785327f21262d1e128dc229a9ed2e468deb0bb017
gcs_node_manager.cc:223: Removing node, node id = 5a51498ed849055785327f21262d1e128dc229a9ed2e468deb0bb017
gcs_placement_group_manager.cc:758: Node 5a51498ed849055785327f21262d1e128dc229a9ed2e468deb0bb017 failed, rescheduling the placement groups on the dead node.
gcs_actor_manager.cc:1038: Node 5a51498ed849055785327f21262d1e128dc229a9ed2e468deb0bb017 failed, reconstructing actors.
gcs_server.cc:380: [61] Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
gcs_node_manager.cc:145: Raylet 5a51498ed849055785327f21262d1e128dc229a9ed2e468deb0bb017 is drained. Status GrpcUnavailable: RPC Error message: Cancelling all calls; RPC Error details: . The information will be published to the cluster.
gcs_health_check_manager.cc:108: Health check failed for node e1c139bc7f7410b3dfe7509d7554530b5c168da3097e70ed2c41ec41, remaining checks 3, status 14, response status 0, status message failed to connect to all addresses, status details

Actor with name ‘_ray_internal_job_actor_raysubmit_c1N1gyTSNAhRkRTg’ was not found.
[2023-10-03 16:41:10,226 I 128 128] (gcs_server) gcs_actor_manager.cc:253: Registering actor, job id = 0b000000, actor id = d1a5f4a60466c4c794362a580b000000
[2023-10-03 16:41:10,226 I 128 128] (gcs_server) gcs_actor_manager.cc:259: Registered actor, job id = 0b000000, actor id = d1a5f4a60466c4c794362a580b000000
[2023-10-03 16:41:10,227 I 128 128] (gcs_server) gcs_actor_manager.cc:278: Creating actor, job id = 0b000000, actor id = d1a5f4a60466c4c794362a580b000000
[2023-10-03 16:41:10,227 I 128 128] (gcs_server) gcs_actor_scheduler.cc:312: Start leasing worker from node db61453214609bbd94707f837e00c98ceaf01bfcf94f311a7d216494 for actor d1a5f4a60466c4c794362a580b000000, job id = 0b000000
[2023-10-03 16:41:10,230 I 128 128] (gcs_server) gcs_actor_scheduler.cc:433: The lease worker request from node db61453214609bbd94707f837e00c98ceaf01bfcf94f311a7d216494 for actor d1a5f4a60466c4c794362a580b000000(JobSupervisor.init) has been canceled, job id = 0b000000, cancel type: SCHEDULING_CANCELLED_UNSCHEDULABLE
[2023-10-03 16:41:10,230 I 128 128] (gcs_server) gcs_actor_manager.cc:807: Destroying actor, actor id = d1a5f4a60466c4c794362a580b000000, job id = 0b000000
[2023-10-03 16:41:10,230 I 128 128] (gcs_server) gcs_actor_manager.cc:730: Actor name _ray_internal_job_actor_raysubmit_c1N1gyTSNAhRkRTg is cleand up.
[2023-10-03 16:41:10,230 I 128 128] (gcs_server) gcs_actor_manager.cc:294: Finished creating actor, job id = 0b000000, actor id = d1a5f4a60466c4c794362a580b000000, status = SchedulingCancelled: Actor creation cancelled.
[2023-10-03 16:41:10,233 I 128 128] (gcs_server) gcs_actor_manager.cc:807: Destroying actor, actor id = d1a5f4a60466c4c794362a580b000000, job id = 0b000000
[2023-10-03 16:41:10,233 I 128 128] (gcs_server) gcs_actor_manager.cc:812: Tried to destroy actor that does not exist d1a5f4a60466c4c794362a580b000000
[2023-10-03 16:41:10,490 W 128 143] (gcs_server) gcs_task_manager.cc:240: Max number of tasks event (100000) allowed is reached. Old task events will be overwritten. Set RAY_task_events_max_num_task_in_gcs to a higher value to store more.

Topic		Replies	Views
Head node fails to ssh into worker nodes Ray Clusters	6	1920	August 17, 2022
Ray head crashed silently Ray Clusters	6	71	September 25, 2024
Issue with ray cluster in Red hat machine Ray Clusters	1	478	August 26, 2022
Unable to connect to head node Ray Clusters	4	763	July 12, 2022
Local Ray cluster won't send any tasks to worker node Ray Clusters	11	898	August 6, 2024

Head node failed to connect to all its worker nodes

Related topics