I am trying to use checkpointing concept for actors and restore the state of the actor in another actor and resume from where it left.
Scenario: When a task is submitted in actors, the process will be distributed to multiple actors and it will execute in parallel. If an actor dies inbetween the process we will get RayActorError and the tasks that were scheduled to that actor will not execute as the actor got killed. I am trying to checkpoint each actor’s state (also application state) and when an actor dies inbetween it has to restart a new actor and continue the process that were running in the died actor in the newly started actor.
Questions:
- Can we attempt to restart the killed actor itself with same ID?
- How can we identify which actor has been killed and redirect to the checkpoint file of the killed actor to the newly started actor?
- Where can we get the metadata of tasks that were queued to the died actor so that we can use that data to tell the newly started actor to continue with those processes?