Ray worker behaviour

What happen when a worker not collapses. Documents mention that tasks in that worker will be taken care by other worker nodes based on max retry configuration. I want to know how the worker come up again? Why these details are not mentioned in any documents?

This cluster is started using yaml file and mentioned head_ip and worker_ips.

Does this fault tolerance in Ray (both at application and system level) and how each of the tasks, actors, and objects are handled does not answer your questions?

Are you asking what’s the default retry count?

@BalajiSelvaraj10 We have to distinguish between worker processes and nodes (a Ray cluster node) that run worker processes. What are you asking worker processes go down on a ray cluster node or the node itself goes down (hence killing al the worker processes).

It’s easy to conflate the two: worker processes and worker nodes (these are individual Ray cluster nodes that run worker processes, which in turn run tasks and actors).

@sangcho He seems to query how does the worker (here I assume he means individual worker processes ruining tasks and actors) come up? Who’s responsible to detect it and launch it on that node on which workers died.

Hmm actually reading it again, it sounds like worker node? I will answer based on this assumption.

usually, worker restart is taken care by ray autoscaler. Basically this is what’s happening when your task fails. If you use kuberay, I believe it has the same functionality. If you use manual deployment, you should build your own mechanism to restart nodes.

  1. Task fails due to worker node failure
  2. the submitter of the task will resubmit task.
  3. Since the task requires resources (let’s say “num_cpus=1”), you will have demands {“CPU”: 1}; IF there are nodes that have this resources, we just reschedule there.
  4. If there’s no resources in the cluster, Autoscaler detects the demand, and it starts a new node. Task will be scheduled on a new node.
  5. task is scheduled there.

Hi @sangcho & @Jules_Damji , Thanks for your replies.

I have a cluster setup (1head and 2worker nodes) which was started using ray autoscaler with yaml file(command: ray up default.yaml).

Here what will happen if one of the worker node collapses because of high usage of resource(example OOM).?

Also adding to this question, Do ray have any fault tolerance mechanism for head node like one of the worker node will become head if head collapses?., I saw a mechanism using redis to bring back states after head restarted, but I need the one which I mentioned.

Hi @sangcho & @Jules_Damji ., Could I get any update on above question?

@BalajiSelvaraj10

Here what will happen if one of the worker node collapses because of high usage of resource(example OOM).?

Let’s distinguish between Ray worker node and Ray worker process running on a Ray worker node.
OOM errors may kill a worker process running your Ray Task on a Ray worker node, not the node itself. The Ray 2.3 OOM monitor will kill the Ray Task running, and will retry depending on the max_tries parameter (We explain in the blog how the retries happen).

Also adding to this question, Do ray have any fault tolerance mechanism for head node like one of the worker node will become head if head collapses?., I saw a mechanism using redis to bring back states after head restarted, but I need the one which I mentioned.

If a particular Ray worker node dies, autoscaler will relaunch it. For Ray applications, the fault tolerance is handled as described here..

For head node we have documented how you can use this with Redis and KubeRay. The head node HA is only supported with KubeRay with Redis.

In short, to get HA for the head node you will have to run on K8s with KubeRay and Redis as a backup store to replicate all the GCS data.

cc: @jjyao

1 Like