Detached Actor. Correct Definition and Declaration(Can't reproduce consistently :( )?

I’m trying to create a detached actor so that I can use it in another driver script. This is still local testing. This code works inconsistently(it fails sometimes) and I don’t know how to reproduce a consistent success or failure

I expected the following code to work?

# tf1/main.py
import tensorflow as tf
import time
DEPLOY_TIME = time.time()
class Predictor:
    def __init__(self):
        pass

    def work(self):
        return tf.__version__ + f"|{DEPLOY_TIME}"
        pass



# ray_url = "ray://localhost:10002"

if __name__ == "__main__":
    print("Deploy Time:" + str(DEPLOY_TIME))

    import ray
    with ray.init(namespace='indexing'):
        try:
            old = ray.get_actor("tf1")
            print("Killing TF1")
            ray.kill(old)
        except ValueError:
            print("Not Killing TF1 as it's not present")


        PredictorActor = ray.remote(Predictor)
        PredictorActor.options(name="tf1", lifetime="detached").remote()


If I add the below three lines at the end, it works consistently.

        a = ray.get_actor("tf1")
        print("Named Actor Call")
        print(ray.get(a.work.remote()))

I’m calling the above code in another driver script

# indexing/main.py
import ray

ray.init(namespace="indexing")
print("Ray Namespace")
print(ray.get_runtime_context().namespace)

print("In Pipeline Indexing Both")
a = ray.get_actor("tf1")
print(ray.get(a.work.remote()))

a = ray.get_actor("tf2")
print(ray.get(a.work.remote()))

My run script

# indexing/run.sh
cd /home/rajiv/Documents/dev/bht/wdml/steps/tf1 &&
source ./venv/bin/activate &&
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["tensorflow==1.15"], "excludes": ["venv"]}' -- python main.py     &&
cd /home/rajiv/Documents/dev/bht/wdml/pipelines/indexing &&
source /home/rajiv/venvs/indexing/bin/activate &&
ray job submit --runtime-env-json='{"working_dir": "./", "pip": []}' -- python main.py

The error I get is

Traceback (most recent call last):
  File "main.py", line 10, in <module>
    a = ray.get_actor("tf1")
  File "/home/rajiv/venvs/tf2/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/rajiv/venvs/tf2/lib/python3.7/site-packages/ray/worker.py", line 2031, in get_actor
    return worker.core_worker.get_named_actor_handle(name, namespace or "")
  File "python/ray/_raylet.pyx", line 1875, in ray._raylet.CoreWorker.get_named_actor_handle
  File "python/ray/_raylet.pyx", line 171, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'tf1'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

Details:

  • Ray 1.12 Default
  • All code is submitted via the ray job api

I think this is probably a race condition where your first script may be exiting before the actor is successfully created. This is because the .remote() call is async. When you call ray.get() on an actor method, that forces the script to block until the actor is created successfully.

The workaround is to always ray.get() to ensure the actor is up prior to exiting the launch script.

cc @Chen_Shen @yic I think we should wait a little bit for actors to get registered successfully prior to exiting the job.

1 Like

Thanks for reporting, we created an issue here: [Core] race condition between job exits and actor creation. · Issue #24890 · ray-project/ray · GitHub

Thanks for your quick response.

re: workaround. sounds good.