Restarting task that was running before Actor killed for OOM

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have an Actor that is created it my driver script, it is the supervisor of other Actors that are dynamically placed into ActorPools. the supervisor is continuously managing the ActorPools, adding actors, removing actors, submitting work to actors, etc.

In my driver script I create the supervisor like this:

supervisor = Supervisor.options(name='supervisor').remote()
supervisor.execute_one_time_task.remote()
supervisor.execute_main_control_loop.remote() # this is the main loop that does the continuous planning of ActorPools

The problem arises when the Supervisor actor dies due to OOM error. The supervisor is restarted as expected but on restart only the code in the supervisor init method is executed.

My question is what pattern should I look at to make sure that the previously executing remote task that the actor was executing before it was killed is restarted?

If supervisor.execute_main_control_loop.remote() was running before the actor was killed how can I make sure that is restarted when the actor is revived by ray?

Is there a callback function that ray can execute on the actor when it is restarted by the cluster, in that function I could rebuild the state of my actor.

How are other teams handling such a situation?

Thank You.

Have you tried packaging and submitting them as Ray Jobs instead?

I have not. I’m not sure I follow the suggestion.

@Sam_Chan Are you suggesting that instead of submitting the work to ActorPools the I have the Supervisor programmatically create jobs and submit those jobs to the cluster?