Driver on exit fails detached Actor Method

uduse · July 5, 2024, 2:09am

Say I have a script run.py locally that looks like

from time import sleep
from uuid import uuid4
import ray

ray.init(...) # connect to my ray cluster, including all the runtime_env setups

@ray.remote
class LongTaskRunner:
    def run_long_task(self) -> None:
        print("Running long task")
        for i in range(120):
            print(f"Step {i + 1}/120")
            sleep(1)
        print("Task completed")


name = uuid4().hex
runner = LongTaskRunner.options(lifetime="detached", name=name).remote()
runner.run_long_task.remote()

sleep(5)

Then in my terminal I do

python run.py

Which will launch the actor in detached mode.

Though the Actor will remain alive after the driver exits, the call to Actor’s run_long_task will fail. According to the dashboard, the ERROR STACK TRACE says:

Error Type: WORKER_DIED

Job finishes (10000000) as driver exits. Marking all non-terminal tasks as failed.

How do I make sure the actor method don’t get terminiated?

Ruiyang_Wang · July 8, 2024, 4:55pm

Can’t reproduce. I am working in a ray cluster:

use your script to create a detached actor
use this script to get it and invoke another method

import ray

# namespace from the warning log
ray.init(namespace="f068f1e1-88cc-4dc5-9a1b-d1ab6ef6fe3a")

# name from Dashboard
a = ray.get_actor("e199350023f943b5a8cb696f8157f5aa")
ray.get(a.run_long_task.remote())

and it worked. In Dashboard I can see logs of both tasks from the 2 scripts, and after the task from this other script finished it returned from ray.get normally.

Can you make a repro?

uduse · July 9, 2024, 2:26pm

What job submission interface are you using? I’m using Ray’s Python client, not ray submit.

Ruiyang_Wang · July 9, 2024, 8:03pm

can you share your setup a bit more in detail? I created my local cluster via ray.init()

uduse · July 10, 2024, 2:14pm

The script I shared is mostly comeplete. The only thing I hide is the address and other configuration in ray.init. I have a Ray cluster running remotely, and the ray.init just points to that cluster and sets up appropriate run environment.

In your experiment, you need to remove ray.get from ray.get(a.run_long_task.remote()) . The whole idea is to dispatch the task without waiting it to complete, and hope the detached Actor keep doing the job even when the driver exits.

I suspect if you are using local ray, that ray dies with the driver anyways so detached Actor won’t be effective.

Vibrat · December 28, 2024, 6:29am

I experienced the same issue and turned out ray job submit will cancell tasks if the submitting process exits before all tasks are completed. Adding --no-wait will solve the issue.

ray job submit --no-wait --address http://localhost:8265 --working-dir . -- python remote3.py

Topic		Replies	Views
Ray actor with detached lifetime error, Job finishes as driver exits. Marking all non-terminal tasks as failed Ray Client	3	169	June 9, 2025
The pending tasks/actors remain on Ray Cluster when the driver die unexpected Ray Core	13	2550	February 6, 2023
Detached Actor. Correct Definition and Declaration(Can't reproduce consistently :( )? Ray Core	3	570	May 19, 2022
Long-lived Ray actors Ray Core	1	727	October 30, 2020
Error when stopping a job Check failed: addr_proto.worker_id() != "" Ray Clusters	0	6	June 30, 2024

Driver on exit fails detached Actor Method

Related topics