Came across a new issue when I upgraded to Ray 1.7
I am currently serving some resnet models and using fastapi (with uvicorn server) for the http api interface to the models.
I start the “system” with ray start --head
and serve -n MY_NAMESPACE start
and then uvicorn my.app
Prior to 1.7, if I started the application this way, stopped the uvicorn server and then restarted uvicorn (say for a code change or with using the --reload flag), everything “worked”. After the Ray 1.7 upgrade when I stop the uvicorn server and restart (ray and serve processes are not stopped), I get an error as follows:
ray::ServeController.deploy() (pid=3633, ip=192.168.1.21, repr=<ray.serve.controller.ServeController object at 0x1a9e64190>)
File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result
return self.__get_result()
File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/controller.py", line 261, in deploy
goal_id, updating = self.backend_state_manager.deploy_backend(
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 1320, in deploy_backend
return self._backend_states[backend_tag].deploy(backend_info)
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 756, in deploy
self._save_checkpoint_func()
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/backend_state.py", line 1257, in _save_checkpoint_func
self._kv_store.put(
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/serve/storage/kv_store.py", line 51, in put
ray_kv._internal_kv_put(self.get_storage_key(key), val, overwrite=True)
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/ray/experimental/internal_kv.py", line 86, in _internal_kv_put
updated = ray.worker.global_worker.redis_client.hset(
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/client.py", line 3050, in hset
return self.execute_command('HSET', name, *items)
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/client.py", line 900, in execute_command
conn.send_command(*args)
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/connection.py", line 725, in send_command
self.send_packed_command(self.pack_command(*args),
File "/Users/me/.virtualenvs/myapp/lib/python3.9/site-packages/redis/connection.py", line 717, in send_packed_command
raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 41 while writing to socket. Protocol wrong type for socket.
Once I get this error, I need to serve shutdown
and ray stop
and then restart everything to get the whole application running again.
Not sure if this is a bug or if I was just never cleaning up my models properly and then that got me to thinking, what is the “right” way to handle application reloads/restarts.
When I close my application, should I delete all the server deployments and redeploy on each application start? That seems fine for a single node app, but If you have multiple nodes running and restart one, it doesn’t seem like the model deployments should be removed.
Wondering what the development guidance for this scenario is…thanks