The pending tasks/actors remain on Ray Cluster when the driver die unexpected

Hi there,

I summited a ray job via a python script on the terminal. I know if I press ctrl+c to kill the process then the pending tasks/actors will be removed on the Ray cluster(Killing driver does not kill tasks in Ray on minikube - #8 by Alex).

However, if the python program exit unexpectedly then the remaining pending tasks/actors will live on the Ray cluster forever. This issue will also cause the Ray cluster never scale down.

Example code as shown

from time import sleep
import ray

def go():
    1 / 0
    return "OK"

objs = [go.remote() for _ in range(50)]

I also found some discussions on StackOverflow: python - How to kill ray tasks when the driver is dead - Stack Overflow

Hey @Andrew_Li , good question. I am afraid ray is not able to clean up this properly with a dead driver that exited in error.

You might have to reconnect to the cluster and manually clean up the actors/tasks there.

@sangcho Could you confirm this is the case, and do you know if there is anything in ray job tooling that we could deal with such unexpected failure in a cleaner way?

Thanks @rickyyx ,

I would lose every python ObjectRef since the drive exited in error. How can I clean up the pending actors/tasks with ray.cancel() or ray.kill() after reconnecting to the cluster? Do you have any tips?

One more question, is it possible to re-create the ObjectRef by the string? e.g.:

def go():

obj = go.remote()

# obj ---> ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)

new_obj = ray.create_objectref_from_string("c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000")