Ray worker becomes unreachable

Hi Team.,

What happen when a worker node is unreachable for sometime?., Consider Ray cluster was started using autoscaler.

Note: Ray master is alive and managing tasks with other alive workers.

@BalajiSelvaraj10 What does the cluster status show in the Ray Dashboard?

Are there any messages in the log files.

Hi @Jules_Damji , Thanks for your response.

We haven’t faced the scenario yet. I was asked this to know about the behaviour of ray ecosystem in this case.

@BalajiSelvaraj10 Sorry, I misunderstood.

Yes, the Raylet on each node in the clusters sends heartbeats to the GCS, and after sometime when no heartbeat is received from the non-responsive node, the GCS will mark that node as dead. And the autoscaler will launch a new node.

Here is some documentation about Fault tolerance.

Hi @Jules_Damji ,

*Ray master will add new worker node whenever processes are raising in stack.

"Worker nodes will send heartbeat to GCS periodically and when there is no heartbeat for sometime then cluster management declares that worker as dead.

Once the same worker node came alive, that worker will be in the queue and whenever more processes required master will add that as new worker node"

Above is my understanding, kindly correct me If I’m wrong.

@BalajiSelvaraj10 There is no notion of “Ray master” in a Ray cluster. And we don’t use master/slave terms because they do not apply here, nor do they are used elsewhere.

The autoscaler will add a new node to the cluster, and any pending or new tasks to be scheduled will be scheduled on the new node if necessary.

It the dead node comes alive, it’ll join the cluster, send heartbeat to GCS, and GCS will inform all other Raylets running that there’s a new node available.

You might want to read the Ray Architecture Paper to get a better idea, because fault tolerance is across tasks, actors, and shared objects, along with head node and GCS.

I hope that answers your questions. Let me know if you are unclear on anything.

hth__cheers
Jules