How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello,
I use Ray to serve Machine Learning models and I have some questions about the behaviour of “serve.shutdown()”.
In order to reconnect to a remote Ray Serve, I am used to call the sequence “ray.shutdown() + serve.shutdown()”.
I am not really trying to shutdown the remote Ray Serve. I call serve.shutdown() to clean the internal state and avoid the error “Can’t run an actor the server doesn’t have a handle for” (see this).
In version 1.12.1, this is working fine. In fact, I could call serve.shutdown() many times after ray.shutdown() and the result is always the same: no exceptions and serve internal state is cleaned (it makes _global_client = None).
Looking at the code, the function serve.shutdown() checks if _global_client is None before doing anything else.
In version 1.13.0, the function serve.shutdown() now calls the new function get_global_client(), which sometimes creates a new Ray cluster and connects to it. This change may cause some errors in my application.
A workaround is to call “ray.shutdown() + serve.context.set_global_client(None)”. This way, I am able to reconnect to the remote Ray cluster and use the Ray Serve.
So here are my questions:
Should “serve.shutdown() at version 1.13.0” create new clusters ? Is it the intended behaviour ?
What is the recommended way to “clean the internal state” to be able to reconnect to the remote Ray Serve ?
I think “serve.context.set_global_client()” is not a public API.
I suppose “serve.shutdown()” should be used if I want to shutdown the remote server.
Could we have a new function to do this “cleaning” ?
If you set env variable RAY_ADDRESS=ray://YOURHOST:10001, and call “ray.shutdown() + serve.shutdown()”. This will shutdown your remote Ray Serve. Is this the intended behaviour ?
See below an example of errors using “serve.shutdown()”.
Thanks in advance!
Example of errors using “serve.shutdown()”.
(environment variable RAY_ADDRESS is not set)
In [1]: import ray
In [2]: from ray import serve
In [3]: ray.init(address="ray://172.17.0.2:10001")
Out[3]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)
In [4]: serve.list_deployments()
Out[4]: {}
In [5]: ray.shutdown()
In [6]: serve.shutdown()
In [7]: ray.is_initialized()
Out[7]: False
In [8]: ray.init(address="ray://172.17.0.2:10001")
Out[8]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)
### "serve.list_deployments() is not called now"
In [9]: ray.shutdown()
In [10]: serve.shutdown()
2022-07-10 02:41:05,598 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
### "Local cluster was created"
In [11]: ray.is_initialized()
Out[11]: True
In [12]: ray.init(address="ray://172.17.0.2:10001")
Out[12]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)
In [13]: serve.list_deployments()
Out[13]: {}
In [14]: ray.shutdown()
In [15]: serve.shutdown()
Exception: Ray Client is not connected. Please connect by calling `ray.init`.
In [16]: ray.is_initialized()
Out[16]: True
In [17]: ray.init(address="ray://172.17.0.2:10001")
Out[17]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)
In [18]: serve.list_deployments()
Caught schedule exception
Exception: Can't run an actor the server doesn't have a handle for