Fault tolerance with Ray actors

Fot · June 22, 2021, 1:42pm

Hi all!
I am a bit confused regarding how a new Ray actor is reconstructed after a failure.

Imagine an actor which has a specific state, which might be the result of various tasks this actor executed. Now, this actor somehow fails, and Ray starts a new actor. Can the new actor somehow automatically (through a Ray mechanism) get the old state? Or is this a matter of checkpointing done by the application? Or is Ray going to execute again all tasks that generated this state?

architkulkarni · June 23, 2021, 11:57pm

Hi, great question! To save any internal state of the actor, you will need to do checkpointing at the application level. You can check the implementation of the Ray Serve controller actor for one example of this kind of checkpointing: ray/controller.py at master · ray-project/ray · GitHub

For details about how individual actor tasks are retried, see here : Fault Tolerance — Ray v2.0.0.dev0

Topic		Replies	Views
Best practice for custom actor recovery Ray Core	1	342	May 23, 2022
Ray checkpointing and restoring the process for actors Ray Workflows	0	152	March 4, 2024
Newbi Question: Worker Fault Tolerance?	4	561	February 28, 2022
How to prevent ray from retrying an actor task while the actor is restarting? Ray Core	1	234	October 31, 2023
Ray Actor failover Ray Core	6	1024	August 21, 2021

Fault tolerance with Ray actors

Related topics