How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have an Actor that is created it my driver script, it is the supervisor of other Actors that are dynamically placed into ActorPools. the supervisor is continuously managing the ActorPools, adding actors, removing actors, submitting work to actors, etc.
In my driver script I create the supervisor like this:
supervisor = Supervisor.options(name='supervisor').remote()
supervisor.execute_one_time_task.remote()
supervisor.execute_main_control_loop.remote() # this is the main loop that does the continuous planning of ActorPools
The problem arises when the Supervisor actor dies due to OOM error. The supervisor is restarted as expected but on restart only the code in the supervisor init method is executed.
My question is what pattern should I look at to make sure that the previously executing remote task that the actor was executing before it was killed is restarted?
If supervisor.execute_main_control_loop.remote() was running before the actor was killed how can I make sure that is restarted when the actor is revived by ray?
Is there a callback function that ray can execute on the actor when it is restarted by the cluster, in that function I could rebuild the state of my actor.
How are other teams handling such a situation?
Thank You.