This is a discussion of best practice, not an issue report.
I have seen this: Advanced pattern: Fault Tolerance with Actor Checkpointing — Ray v1.9.0
I am under assumption that Ray has default restart behavior correponding to failed actors when a node in a cluster goes down – presumably it tries to schedule those actors onto nodes in the cluster that are still alive.
The tutorial linked here uses custom logic to handle state loading from checkpoints. This results in the code path being different from normal code, and not very stable. For example I don’t think the code linked would handle nested failures – i.e. a machine fails immediately after the first machine fails and trigger the fault recovery mechanism.
A more stable approach might be to just expose some handler method on the actor which is always called when the actor is rescheduled on a new machine by Ray’s automatic fault recovery mechanism. This handler method can then initialize necessary state and try to hook this actor into the system. This removes the need for centralized code to handle fault recovery, which I believe is more stable.
Ray experts, please share your thoughts on this.
The use case is I am trying to build a MPP database on Ray, in case you are interested.