How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have a setup where I run multiple actors (with max_restarts = -1) in a ray cluster and I see that sometimes worker process thats running the actor exits due to some issue (not sure at the moment could be OOM or something else) and the actor is marked to be restarted by the GCS however this hangs indefinitely. In the gcs logs I can see that GCS is trying to restart the actor A on worker W on node N indefinitely even though the worker W is dead and this is also logged in the gcs log. So I think what is happening is that somehow GCS is not aware that the worker W is dead and it tries to restart the actor on a dead worker.
My question is can this happen? Why does GCS try to restart the actor on a dead worker?
thank you for replying @yic I don’t yet have a minimally reproducible script but I will work on producing one. In the meantime can you help confirm these questions:
ray.kill does this also kill the CoreWorker process running in the node that the actor is attached to?
Also does the worker node raylet control which CoreWorkers are alive on a given worker node and is this information communicated to the GCS or the head raylet (in a cluster setting)?
If a worker node has no CoreWorker processes running maybe they were killed by ray.kill (if #1 is true) and if I try to create a new actor on that node with no CoreWorker processes running should GCS first communicate with the raylet on the worker node to create CoreWorker process to attach the actor onto?
ray.kill does this also kill the CoreWorker process running in the node that the actor is attached to?
Yes
Also does the worker node raylet control which CoreWorkers are alive on a given worker node and is this information communicated to the GCS or the head raylet (in a cluster setting)?
We don’t have head raylet concept. The information is pushed to GCS.
If a worker node has no CoreWorker processes running maybe they were killed by ray.kill (if #1 is true) and if I try to create a new actor on that node with no CoreWorker processes running should GCS first communicate with the raylet on the worker node to create CoreWorker process to attach the actor onto?
Yes
CoreWorker is just the place to run work. If there is no such process in the node, it’ll be created by the raylet.