Fault tolerance with Ray actors

Hi all!
I am a bit confused regarding how a new Ray actor is reconstructed after a failure.

Imagine an actor which has a specific state, which might be the result of various tasks this actor executed. Now, this actor somehow fails, and Ray starts a new actor. Can the new actor somehow automatically (through a Ray mechanism) get the old state? Or is this a matter of checkpointing done by the application? Or is Ray going to execute again all tasks that generated this state?

Hi, great question! To save any internal state of the actor, you will need to do checkpointing at the application level. You can check the implementation of the Ray Serve controller actor for one example of this kind of checkpointing: ray/controller.py at master · ray-project/ray · GitHub

For details about how individual actor tasks are retried, see here : Fault Tolerance — Ray v2.0.0.dev0

1 Like