Newbi Question: Worker Fault Tolerance?

Just started to dig into Wanted to understand how the worker fault-tolerance behaves. If a worker is processing a remote call and suddenly disappears. What happens next?


@radiantone: Ray will rerun the task until either the task succeeds or the maximum number of retries is exceeded. The default number of retries is 3. For the details, see ray doc

If that worker never comes back, does ray transparently find another worker to run the task? Or does it fail? The docs weren’t really clear on that. It says it will try to restart the actor, but what I’m looking for is to retry the task somewhere else. Perhaps a worker on another node.

You need to read ray doc further to get context on the task and actor and understand how ray handles and run them. I suggest, try running the examples to get better understanding.

  1. Ray task can able to run on any available workers if it meets the resource requirements and if not it will fail after max_retries which is default value 3 times.
  2. Ray actor and their tasks run only on the dedicated worker. In case of failure, ray will try to restart actor worker on any available nodes if it meets the resource requirements and depends on the placement group strategy and it will rerun their tasks either at-most-once or at-least-once semantics.
    @yic and @sangcho: Please, chime in if I missed any.

Thanks for your response @mmuru ! One more thing is the actor fault tolerance is not enabled by default (but 3 times retry for the task is enabled by default). Please check Fault Tolerance — Ray 2.0.0.dev0 for more details!