Say I have a script run.py locally that looks like
from time import sleep
from uuid import uuid4
import ray
ray.init(...) # connect to my ray cluster, including all the runtime_env setups
@ray.remote
class LongTaskRunner:
def run_long_task(self) -> None:
print("Running long task")
for i in range(120):
print(f"Step {i + 1}/120")
sleep(1)
print("Task completed")
name = uuid4().hex
runner = LongTaskRunner.options(lifetime="detached", name=name).remote()
runner.run_long_task.remote()
sleep(5)
Then in my terminal I do
python run.py
Which will launch the actor in detached mode.
Though the Actor will remain alive after the driver exits, the call to Actor’s run_long_task will fail. According to the dashboard, the ERROR STACK TRACE says:
Error Type: WORKER_DIED
Job finishes (10000000) as driver exits. Marking all non-terminal tasks as failed.
How do I make sure the actor method don’t get terminiated?
use this script to get it and invoke another method
import ray
# namespace from the warning log
ray.init(namespace="f068f1e1-88cc-4dc5-9a1b-d1ab6ef6fe3a")
# name from Dashboard
a = ray.get_actor("e199350023f943b5a8cb696f8157f5aa")
ray.get(a.run_long_task.remote())
and it worked. In Dashboard I can see logs of both tasks from the 2 scripts, and after the task from this other script finished it returned from ray.get normally.
The script I shared is mostly comeplete. The only thing I hide is the address and other configuration in ray.init. I have a Ray cluster running remotely, and the ray.init just points to that cluster and sets up appropriate run environment.
In your experiment, you need to remove ray.get from ray.get(a.run_long_task.remote()) . The whole idea is to dispatch the task without waiting it to complete, and hope the detached Actor keep doing the job even when the driver exits.
I suspect if you are using local ray, that ray dies with the driver anyways so detached Actor won’t be effective.