The pending tasks/actors remain on Ray Cluster when the driver die unexpected

Hi there,

I summited a ray job via a python script on the terminal. I know if I press ctrl+c to kill the process then the pending tasks/actors will be removed on the Ray cluster(Killing driver does not kill tasks in Ray on minikube - #8 by Alex).

However, if the python program exit unexpectedly then the remaining pending tasks/actors will live on the Ray cluster forever. This issue will also cause the Ray cluster never scale down.

Example code as shown

from time import sleep
import ray


@ray.remote
def go():
    sleep(10)
    1 / 0
    return "OK"


objs = [go.remote() for _ in range(50)]
print(ray.get(objs))

I also found some discussions on StackOverflow: python - How to kill ray tasks when the driver is dead - Stack Overflow

Hey @Andrew_Li , good question. I am afraid ray is not able to clean up this properly with a dead driver that exited in error.

You might have to reconnect to the cluster and manually clean up the actors/tasks there.

@sangcho Could you confirm this is the case, and do you know if there is anything in ray job tooling that we could deal with such unexpected failure in a cleaner way?

Thanks @rickyyx ,

I would lose every python ObjectRef since the drive exited in error. How can I clean up the pending actors/tasks with ray.cancel() or ray.kill() after reconnecting to the cluster? Do you have any tips?

One more question, is it possible to re-create the ObjectRef by the string? e.g.:

@ray.remote
def go():
    ...


obj = go.remote()

# obj ---> ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)

new_obj = ray.create_objectref_from_string("c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000")