How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi,
I’m currently evaluating ray serve as an option for serving models 24/7 (here a tf model). As long as enough memory is present, Ray seems to be stable, however Ray will eventually run out of memory and will try to restart the deployments in a ever shortening interval what leads to a spiral of death, where the majority of rest requests will fail.
Here we can see, that deployment fails faster over time until we end up in a loop of creation and destruction of deployments.
We can also see that ray kills processes, which results in freeing up memory, until the memory is used up again by the new deployment. As other ray processes seem to use more memory over time, the deployments run out of memory faster.
[See figure 2 below, as only one image is allowed per post.]
As the internal ray oom memory management leads to an instable state, where no requests can be served any more, I’m wondering what’s the recommendation for a 24/7h setup. Is the only option to restart the cluster periodically or should the oom be handled by docker (restarting the container)?
Processes which are killed:
106089 0.73GiB ray::ServeReplica:my_model
105970 0.19GiB ray::ServeReplica:Supervisor
One can also observe that memory increases for the replicas/deployments which are not killed:
E.g. 226 0.08GiB ray::ServeController.listen_for_change()
INFO 2022-12-10 18:07:32,970 controller 226 deployment_state.py:548 - Health check for replica mymodel_model#KPWydA failed: ^[[36mray::ServeReplica:mymodel_model.check_health()^[[39m (pid=300, ip=172.17.0.2, repr=<ray.serve._private.replica.ServeReplica:mymodel_model object at 0x7fdbb8155ee0>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 6c7c1a870467 is used (1.52 / 1.6 GB). The top 10 memory consumers are:
PID MEM COMMAND
300 0.76GiB ray::ServeReplica:my_model
301 0.2GiB ray::ServeReplica:Supervisor
226 0.08GiB ray::ServeController.listen_for_change()
8 0.07GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0
267 0.07GiB ray::HTTPProxyActor
58 0.06GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=0.
"controller_226.log" [readonly] 157532L, 12451357C
175 0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-add
267 0.06GiB ray::HTTPProxyActor
10 0.06GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
31 0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py -
123 0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-d
8 0.02GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0
In addition, up to 0.36 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---
INFO 2022-12-13 10:14:32,689 controller 226 deployment_state.py:548 - Health check for replica Supervisor#HXfqgf failed: ^[[36mray::ServeReplica:Supervisor.check_health()^[[39m (pid=105970, ip=172.17.0.2, repr=<ray.serve._private.replica.ServeReplica:Supervisor object at 0x7fb8c07b4ee0>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 6c7c1a870467 is used (1.58 / 1.6 GB). The top 10 memory consumers are:
PID MEM COMMAND
106089 0.73GiB ray::ServeReplica:my_model
105970 0.19GiB ray::ServeReplica:Supervisor
226 0.19GiB ray::ServeController.listen_for_change()
58 0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=0.
175 0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-add
267 0.06GiB ray::HTTPProxyActor
10 0.06GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
31 0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py -
123 0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-d
8 0.02GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0
In addition, up to 0.36 GiB of shared memory is currently being used by the Ray object store.
Thanks for any guidance.
Edit: Added images