Detached Actor. Correct Definition and Declaration(Can't reproduce consistently :( )?

rabraham · May 17, 2022, 2:08pm

I’m trying to create a detached actor so that I can use it in another driver script. This is still local testing. This code works inconsistently(it fails sometimes) and I don’t know how to reproduce a consistent success or failure

I expected the following code to work?

# tf1/main.py
import tensorflow as tf
import time
DEPLOY_TIME = time.time()
class Predictor:
    def __init__(self):
        pass

    def work(self):
        return tf.__version__ + f"|{DEPLOY_TIME}"
        pass



# ray_url = "ray://localhost:10002"

if __name__ == "__main__":
    print("Deploy Time:" + str(DEPLOY_TIME))

    import ray
    with ray.init(namespace='indexing'):
        try:
            old = ray.get_actor("tf1")
            print("Killing TF1")
            ray.kill(old)
        except ValueError:
            print("Not Killing TF1 as it's not present")


        PredictorActor = ray.remote(Predictor)
        PredictorActor.options(name="tf1", lifetime="detached").remote()

If I add the below three lines at the end, it works consistently.

        a = ray.get_actor("tf1")
        print("Named Actor Call")
        print(ray.get(a.work.remote()))

I’m calling the above code in another driver script

# indexing/main.py
import ray

ray.init(namespace="indexing")
print("Ray Namespace")
print(ray.get_runtime_context().namespace)

print("In Pipeline Indexing Both")
a = ray.get_actor("tf1")
print(ray.get(a.work.remote()))

a = ray.get_actor("tf2")
print(ray.get(a.work.remote()))

My run script

# indexing/run.sh
cd /home/rajiv/Documents/dev/bht/wdml/steps/tf1 &&
source ./venv/bin/activate &&
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["tensorflow==1.15"], "excludes": ["venv"]}' -- python main.py     &&
cd /home/rajiv/Documents/dev/bht/wdml/pipelines/indexing &&
source /home/rajiv/venvs/indexing/bin/activate &&
ray job submit --runtime-env-json='{"working_dir": "./", "pip": []}' -- python main.py

The error I get is

Traceback (most recent call last):
  File "main.py", line 10, in <module>
    a = ray.get_actor("tf1")
  File "/home/rajiv/venvs/tf2/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/rajiv/venvs/tf2/lib/python3.7/site-packages/ray/worker.py", line 2031, in get_actor
    return worker.core_worker.get_named_actor_handle(name, namespace or "")
  File "python/ray/_raylet.pyx", line 1875, in ray._raylet.CoreWorker.get_named_actor_handle
  File "python/ray/_raylet.pyx", line 171, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'tf1'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

Details:

Ray 1.12 Default
All code is submitted via the ray job api

ericl · May 17, 2022, 9:21pm

I think this is probably a race condition where your first script may be exiting before the actor is successfully created. This is because the .remote() call is async. When you call ray.get() on an actor method, that forces the script to block until the actor is created successfully.

The workaround is to always ray.get() to ensure the actor is up prior to exiting the launch script.

cc @Chen_Shen @yic I think we should wait a little bit for actors to get registered successfully prior to exiting the job.

Chen_Shen · May 17, 2022, 9:52pm

Thanks for reporting, we created an issue here: [Core] race condition between job exits and actor creation. · Issue #24890 · ray-project/ray · GitHub

rabraham · May 19, 2022, 12:24pm

Thanks for your quick response.

re: workaround. sounds good.

Topic		Replies	Views
Driver on exit fails detached Actor Method	5	126	December 28, 2024
Use a class defined in a detached ray actor Ray Core	5	34	January 26, 2025
Ray actor with detached lifetime error, Job finishes as driver exits. Marking all non-terminal tasks as failed Ray Client	3	169	June 9, 2025
When are named actors shared and to whom? Ray Core	5	659	April 17, 2023
How to use container in Runtime Environments? Ray Core	23	4241	October 13, 2022

Detached Actor. Correct Definition and Declaration(Can't reproduce consistently :( )?

Related topics