Driver on exit fails detached Actor Method

Say I have a script run.py locally that looks like

from time import sleep
from uuid import uuid4
import ray

ray.init(...) # connect to my ray cluster, including all the runtime_env setups

@ray.remote
class LongTaskRunner:
    def run_long_task(self) -> None:
        print("Running long task")
        for i in range(120):
            print(f"Step {i + 1}/120")
            sleep(1)
        print("Task completed")


name = uuid4().hex
runner = LongTaskRunner.options(lifetime="detached", name=name).remote()
runner.run_long_task.remote()

sleep(5)

Then in my terminal I do

python run.py

Which will launch the actor in detached mode.

Though the Actor will remain alive after the driver exits, the call to Actor’s run_long_task will fail. According to the dashboard, the ERROR STACK TRACE says:

Error Type: WORKER_DIED

Job finishes (10000000) as driver exits. Marking all non-terminal tasks as failed.

How do I make sure the actor method don’t get terminiated?

Can’t reproduce. I am working in a ray cluster:

  1. use your script to create a detached actor
  2. use this script to get it and invoke another method
import ray

# namespace from the warning log
ray.init(namespace="f068f1e1-88cc-4dc5-9a1b-d1ab6ef6fe3a")

# name from Dashboard
a = ray.get_actor("e199350023f943b5a8cb696f8157f5aa")
ray.get(a.run_long_task.remote())

and it worked. In Dashboard I can see logs of both tasks from the 2 scripts, and after the task from this other script finished it returned from ray.get normally.

Can you make a repro?

What job submission interface are you using? I’m using Ray’s Python client, not ray submit.

can you share your setup a bit more in detail? I created my local cluster via ray.init()

The script I shared is mostly comeplete. The only thing I hide is the address and other configuration in ray.init. I have a Ray cluster running remotely, and the ray.init just points to that cluster and sets up appropriate run environment.

In your experiment, you need to remove ray.get from ray.get(a.run_long_task.remote()) . The whole idea is to dispatch the task without waiting it to complete, and hope the detached Actor keep doing the job even when the driver exits.

I suspect if you are using local ray, that ray dies with the driver anyways so detached Actor won’t be effective.