Hi,
Just started to dig into ray.io. Wanted to understand how the worker fault-tolerance behaves. If a worker is processing a remote call and suddenly disappears. What happens next?
thanks!
Hi,
Just started to dig into ray.io. Wanted to understand how the worker fault-tolerance behaves. If a worker is processing a remote call and suddenly disappears. What happens next?
thanks!
@radiantone: Ray will rerun the task until either the task succeeds or the maximum number of retries is exceeded. The default number of retries is 3. For the details, see ray doc
If that worker never comes back, does ray transparently find another worker to run the task? Or does it fail? The docs weren’t really clear on that. It says it will try to restart the actor, but what I’m looking for is to retry the task somewhere else. Perhaps a worker on another node.
You need to read ray doc further to get context on the task and actor and understand how ray handles and run them. I suggest, try running the examples to get better understanding.
Thanks for your response @mmuru ! One more thing is the actor fault tolerance is not enabled by default (but 3 times retry for the task is enabled by default). Please check Fault Tolerance — Ray 2.0.0.dev0 for more details!