How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I am calling subprocess.run
to run a python script in isolation in my remote function. I had a worker that was exceeding the memory threshold and thus was killed. However I was left with an orphaned process on my machine related to the subprocess.run
command which was still exceeding the memory threshold and thus all tasks in the queue that wanted to use this machine failed.
I tried to look for solutions/workaround and understand a bit more what was going on. I found the documentation page Lifetimes of a User-Spawn Process — Ray 2.40.0 but this is still not completely clear to me.
I understand that there are 2 environment variables: RAY_kill_child_processes_on_worker_exit
(default true
) and RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper
(default false
).
For my usecase where I am calling subprocess.run
, it seems to me that this is a direct child process. So the first environment variable which defaults to True should apply. However it is said that “This won’t work if the worker crashed”. I am not sure about the meaning of “crashed” here. In the example of this documentation page there is a comment # sigkill'ed, the worker's subprocess killing no longer works
.
import ray
import psutil
import subprocess
import time
import os
ray.init(_system_config={"kill_child_processes_on_worker_exit_with_raylet_subreaper":True})
@ray.remote
class MyActor:
def __init__(self):
pass
def start(self):
# Start a user process
process = subprocess.Popen(["/bin/bash", "-c", "sleep 10000"])
return process.pid
def signal_my_pid(self):
import signal
os.kill(os.getpid(), signal.SIGKILL)
actor = MyActor.remote()
pid = ray.get(actor.start.remote())
assert psutil.pid_exists(pid) # the subprocess running
actor.signal_my_pid.remote() # sigkill'ed, the worker's subprocess killing no longer works
time.sleep(11) # raylet kills orphans every 10s
assert not psutil.pid_exists(pid)
When a worker is killed by ray because the memory threshold is exceeded is it sigkill’ed and thus considered as crashed and therefore I have to use the second environment variable? I would have said that this is an intentional kill and therefore not a crash. Happy to know if I am missing something here.
Update: for what it’s worth, in my case, even using RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true
does not kill the process launched with subprocess.run
when the worker is killed due to an out of memory error.