However, if the python program exit unexpectedly then the remaining pending tasks/actors will live on the Ray cluster forever. This issue will also cause the Ray cluster never scale down.
Example code as shown
from time import sleep
import ray
@ray.remote
def go():
sleep(10)
1 / 0
return "OK"
objs = [go.remote() for _ in range(50)]
print(ray.get(objs))
Hey @Andrew_Li , good question. I am afraid ray is not able to clean up this properly with a dead driver that exited in error.
You might have to reconnect to the cluster and manually clean up the actors/tasks there.
@sangcho Could you confirm this is the case, and do you know if there is anything in ray job tooling that we could deal with such unexpected failure in a cleaner way?
I would lose every python ObjectRef since the drive exited in error. How can I clean up the pending actors/tasks with ray.cancel() or ray.kill() after reconnecting to the cluster? Do you have any tips?
One more question, is it possible to re-create the ObjectRef by the string? e.g.:
No, you cannot re-create the ObjectRef from a string.
Actually when I look at your example code, why would driver exit unexpectedly? 1 / 0 happens inside a remote function so it will only crash the worker processes.
Iâm a bit late to the party but Iâm having a similar issue. How exactly do you clean up tasks when youâve lost the object_ref? I can use the CLI to discover the the TASK_ID (itâs stuck in scheduled) but as far as I can tell there isnât a way to kill the task. Am I supposed to manually destroy the cluster whenever I get an orphaned task?
In my case - the tasks failed were orphaned when the following happened:
The autoscaler ran into an Azure resource limit and failed to scale from 0 â n workers
Network connectivity from outside the cluster was interrupted so the external ray submit cluster.yaml test.py disconnected
Ran into the same issue; this is somewhat problematic from a sysadmin point of view. Are the users expected to destroy and bring up the cluster themselves�
They are expected to be cleaned up automatically. If not thatâs a bug we should fix.
@idantene@joshv how did you guys verify there was a leak? Is it from the output of ray status?
Actually I could easily reproduce it by
import ray
ray.init()
@ray.remote
def f():
import time
time.sleep(300)
refs = [f.remote() for _ in range(30000)]
ray.get(refs)
And ctrl + C â call ray status. Interestingly, when I checked raylet, it doesnât have any pending requests. So I think it is a reporting problem (but the tasks are actually not leaked).
Let me create an issue and we will prioritize fixing it.
My current workaround is to disable any CTRL+C in the code and maintaining a list of tasks that potentially should be cancelled:
def cleanup_ray(task_list):
"""Shuts ray connection down and cancels any outstanding tasks"""
import signal
def _noop(signum, frame):
return
# Do not allow CTRL+C while cleaning-up
signal.signal(signal.SIGINT, _noop)
def _cleanup():
logger.info("Cleaning up Ray runtime...")
for task in task_list:
try:
ray.cancel(task, force=True)
except TaskCancelledError:
continue
ray.shutdown()
return _cleanup
# Elsewhere in the code
# Task references are pushed/popped to/from this list as
# they're created and completed
tasks_to_cancel = list()
ray.init(address=f"ray://{RAY_HEAD_NODE}")
atexit.register(cleanup_ray(tasks_to_cancel))
Hi @idantene, as @sangcho mentioned, it should not be necessary to handle ctrl+c or task cleanup yourself, as Ray should do this for you (and if not itâs a bug). It may happen asynchronously, but any tasks or objects should get cleaned up within seconds.
I tried this myself on the script that you provided by killing the driver and confirmed that all of the workers do get cleaned up. But if you have more info on how to reproduce what youâre seeing, that would be great. Thanks!
Just re quoting how we discovered the issue during some early testing in case it got missed in the earlier discussion. To expand on the resource limit issue - the specific issue was that we had a hard limit on public IPâs and the autoscaler failed to scale from 0 â N workers.
Btw for everyone here, how did you guys figure out âresources were leakedâ? Is it from the output of ray status? Or sth else (like leaking processes)?
There was a bug before where the pending tasks will not be cleared when the driver exits, for example if my cluster has the following output from âray statusâ that shows pending tasks/actors:
======== Autoscaler status: 2023-02-02 00:32:46.661892 ========
Node status
Healthy:
1 node_891a298e04bed95c7331750fd88031b6db3ac1f0f07a222bda7e1ca1
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
16.0/16.0 CPU
0.00/27.206 GiB memory
0.00/2.000 GiB object_store_memory
The pending tasks/actors will stay even after the driver that submitted the tasks died.
This is fixed now, and the fix will be released in the upcoming Ray 2.3 release that will be available some time mid Feb.