Thanks Richard. Here is a simple example. Inside the script, we start two services, and then a client that connects to these two services.
os.system("ray start --head")
svr1 = GraphServer.options(name="Gserver", lifetime="detached").remote(0)
svr2 = GraphServer.options(name="Gserver2", lifetime="detached").remote(1)
future1 = svr1.serve.remote()
future2 = svr2.serve.remote()
client = Network.remote(args, 0)
client2 = Network.remote(args, 1)
Normally, when everything goes well, this is a clean and solid paradigm. In my use case though, when it is in continuous integration run, new code breaks (let us say due to bug). That is, for some reason either server1 or server2 crashes (or client crashes), then those clean up code at the end never get a chance to execute, leaving stale actor remain active in the ray cluster. The same situation occurs if one uses this script in dev mode to repeat crashes and debugging.
Now I am doing a ray stop then start first every time it runs. This workaround only works in single machine cluster mode I guess.