Proper way to shutdown and restart ray serve and deployments

Came across a new issue when I upgraded to Ray 1.7

I am currently serving some resnet models and using fastapi (with uvicorn server) for the http api interface to the models.

I start the “system” with ray start --head and serve -n MY_NAMESPACE start and then uvicorn my.app

Prior to 1.7, if I started the application this way, stopped the uvicorn server and then restarted uvicorn (say for a code change or with using the --reload flag), everything “worked”. After the Ray 1.7 upgrade when I stop the uvicorn server and restart (ray and serve processes are not stopped), I get an error as follows:

ray::ServeController.deploy() (pid=3633, ip=192.168.1.21, repr=<ray.serve.controller.ServeController object at 0x1a9e64190>)
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/controller.py", line 261, in deploy
    goal_id, updating = self.backend_state_manager.deploy_backend(
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 1320, in deploy_backend
    return self._backend_states[backend_tag].deploy(backend_info)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 756, in deploy
    self._save_checkpoint_func()
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 1257, in _save_checkpoint_func
    self._kv_store.put(
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/storage/kv_store.py", line 51, in put
    ray_kv._internal_kv_put(self.get_storage_key(key), val, overwrite=True)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/experimental/internal_kv.py", line 86, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/client.py", line 900, in execute_command
    conn.send_command(*args)
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/connection.py", line 725, in send_command
    self.send_packed_command(self.pack_command(*args),
  File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/connection.py", line 717, in send_packed_command
    raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 41 while writing to socket. Protocol wrong type for socket.

Once I get this error, I need to serve shutdown and ray stop and then restart everything to get the whole application running again.

Not sure if this is a bug or if I was just never cleaning up my models properly and then that got me to thinking, what is the “right” way to handle application reloads/restarts.

When I close my application, should I delete all the server deployments and redeploy on each application start? That seems fine for a single node app, but If you have multiple nodes running and restart one, it doesn’t seem like the model deployments should be removed.

Wondering what the development guidance for this scenario is…thanks

Update: I figured out the cause of the error above. I needed to add a version (e…g., version="v1" to my deployments so that each deployment wasn’t a new one.

That being said, if anyone else if doing single-node lifecycle management of their models (with the goal to eventually go to a cluster), I’d be interested in hearing your approaches

Hi @puntime_error, so from stack trace you’re essentially re-deploying your application on the same ray cluster with ray and serve processes (that includes serve controller) still running.

In this case, serve controller keeps your deployment intent (code def, version, num_replicas, etc.) and keep trying to reach to your target state, where version hash is determined by a combination of your config + version provided : version.py - ray-project/ray - Sourcegraph

For simple shutdown and restart, simply calling serve.start() and serve.shutdown() is sufficient; But in your case it’s essentially re-deploying in same cluster, the actual replica hash current serve controller perceives matters to properly map to either new or existing replicas.