Serve.shutdown() and how to reconnect to cluster

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

I use Ray to serve Machine Learning models and I have some questions about the behaviour of “serve.shutdown()”.

In order to reconnect to a remote Ray Serve, I am used to call the sequence “ray.shutdown() + serve.shutdown()”.
I am not really trying to shutdown the remote Ray Serve. I call serve.shutdown() to clean the internal state and avoid the error “Can’t run an actor the server doesn’t have a handle for” (see this).

In version 1.12.1, this is working fine. In fact, I could call serve.shutdown() many times after ray.shutdown() and the result is always the same: no exceptions and serve internal state is cleaned (it makes _global_client = None).
Looking at the code, the function serve.shutdown() checks if _global_client is None before doing anything else.

In version 1.13.0, the function serve.shutdown() now calls the new function get_global_client(), which sometimes creates a new Ray cluster and connects to it. This change may cause some errors in my application.

A workaround is to call “ray.shutdown() + serve.context.set_global_client(None)”. This way, I am able to reconnect to the remote Ray cluster and use the Ray Serve.

So here are my questions:

Should “serve.shutdown() at version 1.13.0” create new clusters ? Is it the intended behaviour ?

What is the recommended way to “clean the internal state” to be able to reconnect to the remote Ray Serve ?
I think “serve.context.set_global_client()” is not a public API.
I suppose “serve.shutdown()” should be used if I want to shutdown the remote server.
Could we have a new function to do this “cleaning” ?

If you set env variable RAY_ADDRESS=ray://YOURHOST:10001, and call “ray.shutdown() + serve.shutdown()”. This will shutdown your remote Ray Serve. Is this the intended behaviour ?

See below an example of errors using “serve.shutdown()”.

Thanks in advance!


Example of errors using “serve.shutdown()”.
(environment variable RAY_ADDRESS is not set)

In [1]: import ray

In [2]: from ray import serve

In [3]: ray.init(address="ray://172.17.0.2:10001")
Out[3]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)

In [4]: serve.list_deployments()
Out[4]: {}

In [5]: ray.shutdown()

In [6]: serve.shutdown()

In [7]: ray.is_initialized()
Out[7]: False

In [8]: ray.init(address="ray://172.17.0.2:10001")
Out[8]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)

### "serve.list_deployments() is not called now"

In [9]: ray.shutdown()

In [10]: serve.shutdown()
2022-07-10 02:41:05,598	INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265

### "Local cluster was created"

In [11]: ray.is_initialized()
Out[11]: True


In [12]: ray.init(address="ray://172.17.0.2:10001")
Out[12]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)

In [13]: serve.list_deployments()
Out[13]: {}

In [14]: ray.shutdown()

In [15]: serve.shutdown()
Exception: Ray Client is not connected. Please connect by calling `ray.init`.

In [16]: ray.is_initialized()
Out[16]: True

In [17]: ray.init(address="ray://172.17.0.2:10001")
Out[17]: ClientContext(dashboard_url='172.17.0.2:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7ff974cd07c0>)

In [18]: serve.list_deployments()
Caught schedule exception
Exception: Can't run an actor the server doesn't have a handle for

Hi @luisp, sorry you’re running into this, and thanks for the detailed investigation and reproduction steps! It’s not intended that serve.shutdown() create a new cluster; this sounds like a bug in Ray. Let me discuss with the team and get back to you.

I was able to reproduce this on Ray 1.13. I’ve filed an issue here [Serve] [Ray Client] `serve.shutdown()` starts new Ray cluster · Issue #26527 · ray-project/ray · GitHub

Thanks for reporting this!

1 Like

Hello @architkulkarni!

Thanks for your response! I will be watching the issue #26527.

Well, I decided to check the other Serve API functions. For instance, if I just call serve.list_deployments(), then a local cluster is also created and I got the error “There is no instance running on this Ray cluster. Please call serve.start(detached=True) to start one.

So I would like to suggest something else: Serve API functions should never create clusters. They could connect to an existing cluster. A remote cluster if RAY_ADDRESS is set. A local cluster otherwise.

Once again, thanks!!
And congratulations for the great project!

Thanks for the feedback, I agree Serve API functions should generally not create a Ray cluster! Perhaps serve.start() should still create one, for simplicity. But I think for all other APIs, creating a Ray cluster is an unexpected side effect.