Best practice for custom actor recovery

This is a discussion of best practice, not an issue report.

I have seen this: Advanced pattern: Fault Tolerance with Actor Checkpointing — Ray v1.9.0

I am under assumption that Ray has default restart behavior correponding to failed actors when a node in a cluster goes down – presumably it tries to schedule those actors onto nodes in the cluster that are still alive.

The tutorial linked here uses custom logic to handle state loading from checkpoints. This results in the code path being different from normal code, and not very stable. For example I don’t think the code linked would handle nested failures – i.e. a machine fails immediately after the first machine fails and trigger the fault recovery mechanism.

A more stable approach might be to just expose some handler method on the actor which is always called when the actor is rescheduled on a new machine by Ray’s automatic fault recovery mechanism. This handler method can then initialize necessary state and try to hook this actor into the system. This removes the need for centralized code to handle fault recovery, which I believe is more stable.

Ray experts, please share your thoughts on this.

The use case is I am trying to build a MPP database on Ray, in case you are interested.

@marsupialtail I think you could achieve that for your use case via something like this:

@ray.remote(max_retries=-1)
class MyActor:
   def __init__(self, ...):
        is_retry = ray.get_runtime_context().was_current_actor_reconstructed()
        if is_retry:
             ...
        else:
             ...

Where during the is_retry block you could fetch data from a named actor or other actor handle, etc. handling global state.

cc @Stephanie_Wang and @Chen_Shen