Actor restart is hanging because GCS cannot schedule the actor on a worker thats exited

cem · June 17, 2023, 12:12am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a setup where I run multiple actors (with max_restarts = -1) in a ray cluster and I see that sometimes worker process thats running the actor exits due to some issue (not sure at the moment could be OOM or something else) and the actor is marked to be restarted by the GCS however this hangs indefinitely. In the gcs logs I can see that GCS is trying to restart the actor A on worker W on node N indefinitely even though the worker W is dead and this is also logged in the gcs log. So I think what is happening is that somehow GCS is not aware that the worker W is dead and it tries to restart the actor on a dead worker.

My question is can this happen? Why does GCS try to restart the actor on a dead worker?

yic · June 21, 2023, 8:52pm

Hi @cem This seems like a bug. Do you have minimal reproducible script for this?

The correct workflow should be:

GCS removed that node
GCS start to reschedule the actor on the other node.

cemk · June 23, 2023, 9:13pm

thank you for replying @yic I don’t yet have a minimally reproducible script but I will work on producing one. In the meantime can you help confirm these questions:

ray.kill does this also kill the CoreWorker process running in the node that the actor is attached to?
Also does the worker node raylet control which CoreWorkers are alive on a given worker node and is this information communicated to the GCS or the head raylet (in a cluster setting)?
If a worker node has no CoreWorker processes running maybe they were killed by ray.kill (if #1 is true) and if I try to create a new actor on that node with no CoreWorker processes running should GCS first communicate with the raylet on the worker node to create CoreWorker process to attach the actor onto?

yic · June 26, 2023, 9:44pm

ray.kill does this also kill the CoreWorker process running in the node that the actor is attached to?

Yes

Also does the worker node raylet control which CoreWorkers are alive on a given worker node and is this information communicated to the GCS or the head raylet (in a cluster setting)?

We don’t have head raylet concept. The information is pushed to GCS.

If a worker node has no CoreWorker processes running maybe they were killed by ray.kill (if #1 is true) and if I try to create a new actor on that node with no CoreWorker processes running should GCS first communicate with the raylet on the worker node to create CoreWorker process to attach the actor onto?

Yes

CoreWorker is just the place to run work. If there is no such process in the node, it’ll be created by the raylet.

cem · June 26, 2023, 10:01pm

thanks @yic for clarifying, this is very helpful

Topic		Replies	Views
Ray Actor creation lasts 5 minutes Ray Core	1	615	February 6, 2022
Ray Actor Dying unexpectedly Ray Core	8	3544	October 21, 2022
Restarting task that was running before Actor killed for OOM Ray Core	3	57	June 25, 2024
[Core] Keep Actors Alive Forever Ray Core	3	515	May 20, 2021
Ray head crashed silently Ray Clusters	6	78	September 25, 2024

Actor restart is hanging because GCS cannot schedule the actor on a worker thats exited

Related topics