Understanding "Lifetimes of a User-Spawn Process"

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am calling subprocess.run to run a python script in isolation in my remote function. I had a worker that was exceeding the memory threshold and thus was killed. However I was left with an orphaned process on my machine related to the subprocess.run command which was still exceeding the memory threshold and thus all tasks in the queue that wanted to use this machine failed.

I tried to look for solutions/workaround and understand a bit more what was going on. I found the documentation page Lifetimes of a User-Spawn Process — Ray 2.40.0 but this is still not completely clear to me.

I understand that there are 2 environment variables: RAY_kill_child_processes_on_worker_exit (default true) and RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper (default false).

For my usecase where I am calling subprocess.run, it seems to me that this is a direct child process. So the first environment variable which defaults to True should apply. However it is said that “This won’t work if the worker crashed”. I am not sure about the meaning of “crashed” here. In the example of this documentation page there is a comment # sigkill'ed, the worker's subprocess killing no longer works.

import ray
import psutil
import subprocess
import time
import os

ray.init(_system_config={"kill_child_processes_on_worker_exit_with_raylet_subreaper":True})

@ray.remote
class MyActor:
  def __init__(self):
    pass

  def start(self):
    # Start a user process
    process = subprocess.Popen(["/bin/bash", "-c", "sleep 10000"])
    return process.pid

  def signal_my_pid(self):
    import signal
    os.kill(os.getpid(), signal.SIGKILL)


actor = MyActor.remote()

pid = ray.get(actor.start.remote())
assert psutil.pid_exists(pid)  # the subprocess running

actor.signal_my_pid.remote()  # sigkill'ed, the worker's subprocess killing no longer works
time.sleep(11)  # raylet kills orphans every 10s
assert not psutil.pid_exists(pid)

When a worker is killed by ray because the memory threshold is exceeded is it sigkill’ed and thus considered as crashed and therefore I have to use the second environment variable? I would have said that this is an intentional kill and therefore not a crash. Happy to know if I am missing something here.

Update: for what it’s worth, in my case, even using RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true does not kill the process launched with subprocess.run when the worker is killed due to an out of memory error.

Hi @albertcthomas , thanks for the question. Is it possible for you to provide a repro script so that we can investigate the unexpected behavior?

Thanks a lot for the reply. I tried using RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true on the head node AND the worker nodes and it seems to solve the issue. I do not have an orphaned process anymore. Is it the intended behavior, that this variable should be set on all the nodes? It still remains unclear to me why I need to set this environment variable in my case and the default one is not enough.

I will try to provide a reproducible script. It is also possible that I am not doing things as I do in the first place so this could be useful.

I gave details on what I am trying to achieve here Calling an application that relies on ray inside a remote function . But independently of my use case I would be happy to have clarifications on the behavior of user-spawn process and the questions I asked above :). Thanks again for the help.

I would be very happy if someone can clarify the behavior of user-spawn process and the corresponding environment variables:

  1. Should RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true be set on all the nodes (all worker and head nodes)? or at least all the nodes where I think I could need it and not only the head not?
  2. When a worker is killed by ray because the memory threshold is exceeded is it sigkill’ed and thus considered as crashed and therefore I have to use RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true if I create a child process with subprocess.run?
  3. Why isn’t a process created with subprocess.run a direct child process? and therefore why RAY_kill_child_processes_on_worker_exit (default true) isn’t enough?

Thanks a lot!