Best way to clean up all stale actors?

I am setting up an integration test. As the very first step of this integration test, I would like to clean up all the stale actors (from previous runs) so that I can start everything fresh.

I tried to use get all the handlers, then ray.kill() them (following the doc here: Using Actors — Ray v2.0.0.dev0). But that requires listing out all the possible actors, plus if any of the actorName does not exist, then there goes “Failed to look up actor with name” exception.

What I am looking for is probably something like a “ray.killall()” I think. Any suggestions? Thanks a lot.

Usually I would run ray.shutdown() to just reset the entire setup. Would that work for you in your use case?

Thanks Richard!

I tried out shutdown() a bit and also skimmed through this page: ray/starting-ray.rst at 35ec91c4e04c67adc7123aa8461cf50923a316b4 · ray-project/ray · GitHub

So far it does not serve my use case well: I used ray.shutdown() at the beginning before init(), but then I still ran into the issue of “ValueError: The name Gserver is already taken”, b/c the stale processes are still there. What did I miss?

Also, according to the inline comments here (ray/worker.py at 9a93dd9682a216d2028db8edb60ff1485f653721 · ray-project/ray · GitHub), shutdown() happens automatically when a python instance finishes. if that is the case then why do we want to explicitly call it?

Hmm, can you provide a simple script as an example for me to better understand what you’re trying to do?

Thanks Richard. Here is a simple example. Inside the script, we start two services, and then a client that connects to these two services.

    os.system("ray stop")
    time.sleep(2)
    os.system("ray start --head")

    ray.init("auto")

    svr1 = GraphServer.options(name="Gserver", lifetime="detached").remote(0)
    svr2 = GraphServer.options(name="Gserver2", lifetime="detached").remote(1)
    future1 = svr1.serve.remote()
    future2 = svr2.serve.remote()

    client = Network.remote(args, 0)
    client2 = Network.remote(args, 1)

    ray.get([client.train.remote(), client2.train.remote()])

     ...
    
    ray.shutdown()
    ray.kill(svr1)
    ray.kill(svr2)

Normally, when everything goes well, this is a clean and solid paradigm. In my use case though, when it is in continuous integration run, new code breaks (let us say due to bug). That is, for some reason either server1 or server2 crashes (or client crashes), then those clean up code at the end never get a chance to execute, leaving stale actor remain active in the ray cluster. The same situation occurs if one uses this script in dev mode to repeat crashes and debugging.

Now I am doing a ray stop then start first every time it runs. This workaround only works in single machine cluster mode I guess.