Proper way to shutdown and restart ray serve and deployments

puntime_error · October 18, 2021, 1:08pm

Came across a new issue when I upgraded to Ray 1.7

I am currently serving some resnet models and using fastapi (with uvicorn server) for the http api interface to the models.

I start the “system” with ray start --head and serve -n MY_NAMESPACE start and then uvicorn my.app

Prior to 1.7, if I started the application this way, stopped the uvicorn server and then restarted uvicorn (say for a code change or with using the --reload flag), everything “worked”. After the Ray 1.7 upgrade when I stop the uvicorn server and restart (ray and serve processes are not stopped), I get an error as follows:

ray::ServeController.deploy() (pid=3633, ip=192.168.1.21, repr=<ray.serve.controller.ServeController object at 0x1a9e64190>)
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/controller.py", line 261, in deploy
    goal_id, updating = self.backend_state_manager.deploy_backend(
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 1320, in deploy_backend
    return self._backend_states[backend_tag].deploy(backend_info)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 756, in deploy
    self._save_checkpoint_func()
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 1257, in _save_checkpoint_func
    self._kv_store.put(
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/storage/kv_store.py", line 51, in put
    ray_kv._internal_kv_put(self.get_storage_key(key), val, overwrite=True)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/experimental/internal_kv.py", line 86, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/client.py", line 900, in execute_command
    conn.send_command(*args)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/connection.py", line 725, in send_command
    self.send_packed_command(self.pack_command(*args),
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/connection.py", line 717, in send_packed_command
    raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 41 while writing to socket. Protocol wrong type for socket.

Once I get this error, I need to serve shutdown and ray stop and then restart everything to get the whole application running again.

Not sure if this is a bug or if I was just never cleaning up my models properly and then that got me to thinking, what is the “right” way to handle application reloads/restarts.

When I close my application, should I delete all the server deployments and redeploy on each application start? That seems fine for a single node app, but If you have multiple nodes running and restart one, it doesn’t seem like the model deployments should be removed.

Wondering what the development guidance for this scenario is…thanks

puntime_error · October 18, 2021, 4:51pm

Update: I figured out the cause of the error above. I needed to add a version (e…g., version="v1" to my deployments so that each deployment wasn’t a new one.

That being said, if anyone else if doing single-node lifecycle management of their models (with the goal to eventually go to a cluster), I’d be interested in hearing your approaches

jiaodong · October 19, 2021, 11:54pm

Hi @puntime_error, so from stack trace you’re essentially re-deploying your application on the same ray cluster with ray and serve processes (that includes serve controller) still running.

In this case, serve controller keeps your deployment intent (code def, version, num_replicas, etc.) and keep trying to reach to your target state, where version hash is determined by a combination of your config + version provided : version.py - ray-project/ray - Sourcegraph

For simple shutdown and restart, simply calling serve.start() and serve.shutdown() is sufficient; But in your case it’s essentially re-deploying in same cluster, the actual replica hash current serve controller perceives matters to properly map to either new or existing replicas.

Topic		Replies	Views
Docker / Apps instantly shutdown after done creating replica Ray Serve	1	320	October 13, 2023
How to preserve state of ray serve on ray cluster restart? Ray Serve	0	442	May 4, 2021
Deploy, delete and use deployments in Ray Serve 2.0.0 Ray Serve	8	2023	December 13, 2022
Serve.shutdown() and how to reconnect to cluster Ray Serve	4	1100	July 14, 2022
New FastAPI HTTP Deployments running on uvicorn Ray Serve	4	2626	September 3, 2021

Proper way to shutdown and restart ray serve and deployments

Related topics