Hi there,
I summited a ray job via a python script on the terminal. I know if I press ctrl+c to kill the process then the pending tasks/actors will be removed on the Ray cluster(Killing driver does not kill tasks in Ray on minikube - #8 by Alex).
However, if the python program exit unexpectedly then the remaining pending tasks/actors will live on the Ray cluster forever. This issue will also cause the Ray cluster never scale down.
Example code as shown
from time import sleep
import ray
@ray.remote
def go():
sleep(10)
1 / 0
return "OK"
objs = [go.remote() for _ in range(50)]
print(ray.get(objs))
I also found some discussions on StackOverflow: python - How to kill ray tasks when the driver is dead - Stack Overflow
Hey @Andrew_Li , good question. I am afraid ray is not able to clean up this properly with a dead driver that exited in error.
You might have to reconnect to the cluster and manually clean up the actors/tasks there.
@sangcho Could you confirm this is the case, and do you know if there is anything in ray job tooling that we could deal with such unexpected failure in a cleaner way?
Thanks @rickyyx ,
I would lose every python ObjectRef
since the drive exited in error. How can I clean up the pending actors/tasks with ray.cancel()
or ray.kill()
after reconnecting to the cluster? Do you have any tips?
One more question, is it possible to re-create the ObjectRef
by the string? e.g.:
@ray.remote
def go():
...
obj = go.remote()
# obj ---> ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)
new_obj = ray.create_objectref_from_string("c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000")