Restarting task that was running before Actor killed for OOM

raycharles · June 21, 2024, 4:00pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have an Actor that is created it my driver script, it is the supervisor of other Actors that are dynamically placed into ActorPools. the supervisor is continuously managing the ActorPools, adding actors, removing actors, submitting work to actors, etc.

In my driver script I create the supervisor like this:

supervisor = Supervisor.options(name='supervisor').remote()
supervisor.execute_one_time_task.remote()
supervisor.execute_main_control_loop.remote() # this is the main loop that does the continuous planning of ActorPools

The problem arises when the Supervisor actor dies due to OOM error. The supervisor is restarted as expected but on restart only the code in the supervisor init method is executed.

My question is what pattern should I look at to make sure that the previously executing remote task that the actor was executing before it was killed is restarted?

If supervisor.execute_main_control_loop.remote() was running before the actor was killed how can I make sure that is restarted when the actor is revived by ray?

Is there a callback function that ray can execute on the actor when it is restarted by the cluster, in that function I could rebuild the state of my actor.

How are other teams handling such a situation?

Thank You.

Sam_Chan · June 24, 2024, 4:35pm

Have you tried packaging and submitting them as Ray Jobs instead?

raycharles · June 24, 2024, 5:55pm

I have not. I’m not sure I follow the suggestion.

raycharles · June 25, 2024, 1:32pm

@Sam_Chan Are you suggesting that instead of submitting the work to ActorPools the I have the Supervisor programmatically create jobs and submit those jobs to the cluster?

Topic		Replies	Views
How to prevent ray from retrying an actor task while the actor is restarting? Ray Core	1	234	October 31, 2023
Confused on the behavior of "RAY_task_oom_retries" in Actor restarts Ray Core	1	330	June 26, 2023
The pending tasks/actors remain on Ray Cluster when the driver die unexpected Ray Core	13	2543	February 6, 2023
Ray checkpointing and restoring the process for actors Ray Workflows	0	151	March 4, 2024
Actor restart is hanging because GCS cannot schedule the actor on a worker thats exited Ray Core	4	335	June 26, 2023

Restarting task that was running before Actor killed for OOM

Related topics