Newbi Question: Worker Fault Tolerance?

radiantone · February 24, 2022, 7:57am

Hi,
Just started to dig into ray.io. Wanted to understand how the worker fault-tolerance behaves. If a worker is processing a remote call and suddenly disappears. What happens next?

thanks!

mmuru · February 24, 2022, 3:00pm

@radiantone: Ray will rerun the task until either the task succeeds or the maximum number of retries is exceeded. The default number of retries is 3. For the details, see ray doc

radiantone · February 24, 2022, 3:18pm

If that worker never comes back, does ray transparently find another worker to run the task? Or does it fail? The docs weren’t really clear on that. It says it will try to restart the actor, but what I’m looking for is to retry the task somewhere else. Perhaps a worker on another node.

mmuru · February 25, 2022, 3:32pm

You need to read ray doc further to get context on the task and actor and understand how ray handles and run them. I suggest, try running the examples to get better understanding.

Ray task can able to run on any available workers if it meets the resource requirements and if not it will fail after max_retries which is default value 3 times.
Ray actor and their tasks run only on the dedicated worker. In case of failure, ray will try to restart actor worker on any available nodes if it meets the resource requirements and depends on the placement group strategy and it will rerun their tasks either at-most-once or at-least-once semantics.
@yic and @sangcho: Please, chime in if I missed any.

sangcho · February 28, 2022, 3:47pm

Thanks for your response @mmuru ! One more thing is the actor fault tolerance is not enabled by default (but 3 times retry for the task is enabled by default). Please check Fault Tolerance — Ray 2.0.0.dev0 for more details!

Topic		Replies	Views
Ray worker behaviour Ray Core	8	623	April 10, 2023
How does Ray get over workers killing/revival? Ray Core	6	1529	June 9, 2023
How to prevent ray from retrying an actor task while the actor is restarting? Ray Core	1	238	October 31, 2023
What happen if one of the workers goes down in between execution Ray Core	2	370	February 5, 2021
Questions about fault tolerance in a Ray cluster Ray Clusters	0	416	December 15, 2021

Newbi Question: Worker Fault Tolerance?

Related topics