Ray worker becomes unreachable

BalajiSelvaraj10 · May 15, 2023, 10:24am

Hi Team.,

What happen when a worker node is unreachable for sometime?., Consider Ray cluster was started using autoscaler.

Note: Ray master is alive and managing tasks with other alive workers.

Jules_Damji · May 15, 2023, 6:45pm

@BalajiSelvaraj10 What does the cluster status show in the Ray Dashboard?

Are there any messages in the log files.

BalajiSelvaraj10 · May 16, 2023, 6:16pm

Hi @Jules_Damji , Thanks for your response.

We haven’t faced the scenario yet. I was asked this to know about the behaviour of ray ecosystem in this case.

Jules_Damji · May 16, 2023, 7:40pm

@BalajiSelvaraj10 Sorry, I misunderstood.

Yes, the Raylet on each node in the clusters sends heartbeats to the GCS, and after sometime when no heartbeat is received from the non-responsive node, the GCS will mark that node as dead. And the autoscaler will launch a new node.

Here is some documentation about Fault tolerance.

BalajiSelvaraj10 · May 17, 2023, 5:35am

Hi @Jules_Damji ,

*Ray master will add new worker node whenever processes are raising in stack.

"Worker nodes will send heartbeat to GCS periodically and when there is no heartbeat for sometime then cluster management declares that worker as dead.

Once the same worker node came alive, that worker will be in the queue and whenever more processes required master will add that as new worker node"

Above is my understanding, kindly correct me If I’m wrong.

Jules_Damji · May 17, 2023, 3:52pm

@BalajiSelvaraj10 There is no notion of “Ray master” in a Ray cluster. And we don’t use master/slave terms because they do not apply here, nor do they are used elsewhere.

The autoscaler will add a new node to the cluster, and any pending or new tasks to be scheduled will be scheduled on the new node if necessary.

It the dead node comes alive, it’ll join the cluster, send heartbeat to GCS, and GCS will inform all other Raylets running that there’s a new node available.

You might want to read the Ray Architecture Paper to get a better idea, because fault tolerance is across tasks, actors, and shared objects, along with head node and GCS.

I hope that answers your questions. Let me know if you are unclear on anything.

hth__cheers
Jules

Topic		Replies	Views
Worker gets killed unexpectedly Ray Clusters	6	45	August 13, 2025
Ray worker behaviour Ray Core	8	624	April 10, 2023
Ray cluster raylet is down but the worker doesn't come back up Ray Clusters	1	411	November 3, 2022
Workers crashes after few seconds automatically Ray Clusters	1	344	March 5, 2025
Questions about fault tolerance in a Ray cluster Ray Clusters	0	416	December 15, 2021

Ray worker becomes unreachable

Related topics