I am setting up an integration test. As the very first step of this integration test, I would like to clean up all the stale actors (from previous runs) so that I can start everything fresh.
I tried to use get all the handlers, then ray.kill() them (following the doc here: Using Actors — Ray v2.0.0.dev0). But that requires listing out all the possible actors, plus if any of the actorName does not exist, then there goes “Failed to look up actor with name” exception.
What I am looking for is probably something like a “ray.killall()” I think. Any suggestions? Thanks a lot.
So far it does not serve my use case well: I used ray.shutdown() at the beginning before init(), but then I still ran into the issue of “ValueError: The name Gserver is already taken”, b/c the stale processes are still there. What did I miss?
Normally, when everything goes well, this is a clean and solid paradigm. In my use case though, when it is in continuous integration run, new code breaks (let us say due to bug). That is, for some reason either server1 or server2 crashes (or client crashes), then those clean up code at the end never get a chance to execute, leaving stale actor remain active in the ray cluster. The same situation occurs if one uses this script in dev mode to repeat crashes and debugging.
Now I am doing a ray stop then start first every time it runs. This workaround only works in single machine cluster mode I guess.