Best practice for custom actor recovery

marsupialtail · May 22, 2022, 1:17am

This is a discussion of best practice, not an issue report.

I have seen this: Advanced pattern: Fault Tolerance with Actor Checkpointing — Ray v1.9.0

I am under assumption that Ray has default restart behavior correponding to failed actors when a node in a cluster goes down – presumably it tries to schedule those actors onto nodes in the cluster that are still alive.

The tutorial linked here uses custom logic to handle state loading from checkpoints. This results in the code path being different from normal code, and not very stable. For example I don’t think the code linked would handle nested failures – i.e. a machine fails immediately after the first machine fails and trigger the fault recovery mechanism.

A more stable approach might be to just expose some handler method on the actor which is always called when the actor is rescheduled on a new machine by Ray’s automatic fault recovery mechanism. This handler method can then initialize necessary state and try to hook this actor into the system. This removes the need for centralized code to handle fault recovery, which I believe is more stable.

Ray experts, please share your thoughts on this.

The use case is I am trying to build a MPP database on Ray, in case you are interested.

ericl · May 23, 2022, 9:14pm

@marsupialtail I think you could achieve that for your use case via something like this:

@ray.remote(max_retries=-1)
class MyActor:
   def __init__(self, ...):
        is_retry = ray.get_runtime_context().was_current_actor_reconstructed()
        if is_retry:
             ...
        else:
             ...

Where during the is_retry block you could fetch data from a named actor or other actor handle, etc. handling global state.

cc @Stephanie_Wang and @Chen_Shen

Topic		Replies	Views
Fault tolerance with Ray actors Ray Core	1	326	June 23, 2021
Ray Actor failover Ray Core	6	1024	August 21, 2021
Newbi Question: Worker Fault Tolerance?	4	561	February 28, 2022
How to prevent ray from retrying an actor task while the actor is restarting? Ray Core	1	237	October 31, 2023
LLM Deployment retries Ray Serve	2	44	January 29, 2025

Best practice for custom actor recovery

Related topics