Best way to clean up all stale actors?

HuangLED · May 24, 2021, 7:11pm

I am setting up an integration test. As the very first step of this integration test, I would like to clean up all the stale actors (from previous runs) so that I can start everything fresh.

I tried to use get all the handlers, then ray.kill() them (following the doc here: Using Actors — Ray v2.0.0.dev0). But that requires listing out all the possible actors, plus if any of the actorName does not exist, then there goes “Failed to look up actor with name” exception.

What I am looking for is probably something like a “ray.killall()” I think. Any suggestions? Thanks a lot.

rliaw · May 25, 2021, 2:54am

Usually I would run ray.shutdown() to just reset the entire setup. Would that work for you in your use case?

HuangLED · May 25, 2021, 5:43pm

Thanks Richard!

I tried out shutdown() a bit and also skimmed through this page: ray/starting-ray.rst at 35ec91c4e04c67adc7123aa8461cf50923a316b4 · ray-project/ray · GitHub

So far it does not serve my use case well: I used ray.shutdown() at the beginning before init(), but then I still ran into the issue of “ValueError: The name Gserver is already taken”, b/c the stale processes are still there. What did I miss?

Also, according to the inline comments here (ray/worker.py at 9a93dd9682a216d2028db8edb60ff1485f653721 · ray-project/ray · GitHub), shutdown() happens automatically when a python instance finishes. if that is the case then why do we want to explicitly call it?

rliaw · June 4, 2021, 7:33am

Hmm, can you provide a simple script as an example for me to better understand what you’re trying to do?

HuangLED · June 6, 2021, 9:04pm

Thanks Richard. Here is a simple example. Inside the script, we start two services, and then a client that connects to these two services.

    os.system("ray stop")
    time.sleep(2)
    os.system("ray start --head")

    ray.init("auto")

    svr1 = GraphServer.options(name="Gserver", lifetime="detached").remote(0)
    svr2 = GraphServer.options(name="Gserver2", lifetime="detached").remote(1)
    future1 = svr1.serve.remote()
    future2 = svr2.serve.remote()

    client = Network.remote(args, 0)
    client2 = Network.remote(args, 1)

    ray.get([client.train.remote(), client2.train.remote()])

     ...
    
    ray.shutdown()
    ray.kill(svr1)
    ray.kill(svr2)

Normally, when everything goes well, this is a clean and solid paradigm. In my use case though, when it is in continuous integration run, new code breaks (let us say due to bug). That is, for some reason either server1 or server2 crashes (or client crashes), then those clean up code at the end never get a chance to execute, leaving stale actor remain active in the ray cluster. The same situation occurs if one uses this script in dev mode to repeat crashes and debugging.

Now I am doing a ray stop then start first every time it runs. This workaround only works in single machine cluster mode I guess.

Topic		Replies	Views
Ray Actor Dying unexpectedly Ray Core	8	3832	October 21, 2022
Exception("Can't run an actor the server doesn't have a handle for") Ray Serve	0	781	February 23, 2022
How to remove dead actors? Ray Core	2	214	May 25, 2024
Serve.shutdown() and how to reconnect to cluster Ray Serve	4	1113	July 14, 2022
Running ray stop shutdowns all processes of different ray clusters Ray Clusters	3	715	April 24, 2023

Best way to clean up all stale actors?

Related topics