24/7 Soak test result: Ray can't recover from OOM errors

al.exe · December 14, 2022, 3:45pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi,

I’m currently evaluating ray serve as an option for serving models 24/7 (here a tf model). As long as enough memory is present, Ray seems to be stable, however Ray will eventually run out of memory and will try to restart the deployments in a ever shortening interval what leads to a spiral of death, where the majority of rest requests will fail.

Here we can see, that deployment fails faster over time until we end up in a loop of creation and destruction of deployments.

We can also see that ray kills processes, which results in freeing up memory, until the memory is used up again by the new deployment. As other ray processes seem to use more memory over time, the deployments run out of memory faster.

[See figure 2 below, as only one image is allowed per post.]

As the internal ray oom memory management leads to an instable state, where no requests can be served any more, I’m wondering what’s the recommendation for a 24/7h setup. Is the only option to restart the cluster periodically or should the oom be handled by docker (restarting the container)?

Processes which are killed:

106089  0.73GiB ray::ServeReplica:my_model
105970  0.19GiB ray::ServeReplica:Supervisor

One can also observe that memory increases for the replicas/deployments which are not killed:

E.g. 226 0.08GiB ray::ServeController.listen_for_change()

INFO 2022-12-10 18:07:32,970 controller 226 deployment_state.py:548 - Health check for replica mymodel_model#KPWydA failed: ^[[36mray::ServeReplica:mymodel_model.check_health()^[[39m (pid=300, ip=172.17.0.2, repr=<ray.serve._private.replica.ServeReplica:mymodel_model object at 0x7fdbb8155ee0>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 6c7c1a870467 is used (1.52 / 1.6 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
300     0.76GiB ray::ServeReplica:my_model
301     0.2GiB  ray::ServeReplica:Supervisor
226     0.08GiB ray::ServeController.listen_for_change()
8       0.07GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0
267     0.07GiB ray::HTTPProxyActor
58      0.06GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=0.
"controller_226.log" [readonly] 157532L, 12451357C
175     0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-add
267     0.06GiB ray::HTTPProxyActor
10      0.06GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
31      0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py -
123     0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-d
8       0.02GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0

In addition, up to 0.36 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---
INFO 2022-12-13 10:14:32,689 controller 226 deployment_state.py:548 - Health check for replica Supervisor#HXfqgf failed: ^[[36mray::ServeReplica:Supervisor.check_health()^[[39m (pid=105970, ip=172.17.0.2, repr=<ray.serve._private.replica.ServeReplica:Supervisor object at 0x7fb8c07b4ee0>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 6c7c1a870467 is used (1.58 / 1.6 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
106089  0.73GiB ray::ServeReplica:my_model
105970  0.19GiB ray::ServeReplica:Supervisor
226     0.19GiB ray::ServeController.listen_for_change()
58      0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=0.
175     0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-add
267     0.06GiB ray::HTTPProxyActor
10      0.06GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
31      0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py -
123     0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-d
8       0.02GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0

In addition, up to 0.36 GiB of shared memory is currently being used by the Ray object store.

Thanks for any guidance.

Edit: Added images

cindy_zhang · December 14, 2022, 5:54pm

Hi @al.exe, could you try re-uploading your images? I can’t see them on my end.

al.exe · December 15, 2022, 8:19am

Figure 2 (Memory usage):

cindy_zhang · December 15, 2022, 7:59pm

Thanks @al.exe, would you happen to have a reproduce script?

al.exe · December 16, 2022, 12:18pm

Hi,

I can’t boil it down to a simple script atm, but I believe on can use the default example using fastapi + a tensorflow model to reproduce the behavior (will verify it when i have some time).

However on a conceptual level, one can summarize it in the following way:

When we start the head node in a memory-limited container we might have the following three processes:

0 HTTPProxyActor
1 ServeReplica:MyModel
2 ServeController

Over time MyModel will increase in memory and will eventually be restarted.
However, other processes like the ServeController also increases its memory usage, and since the overall available memory decreases, the restarted MyModel Replica will run out of memory sooner.

One solution could be to restart ServeController as well (e.g. as it might contain some memory leak), with the default logic only the process which most memory consumption will be restarted though. In general ray processes which might have memory leaks might need to be restarted as well since not enough memory might be left for the other processes, e.g. replica which serves the model.

I used Ray 2.1 (also tried the alpha memory monitor, but this lead to a stop of the head node after some worker failures).

simon-mo · December 19, 2022, 6:53pm

Hi @al.exe, one option is to isolate the issue here. The ServeController process should not leak memory and we have test to ensure that. The memory should growth to a steady state and then become stable. If that’s not the case, we will always treat it as a high priority issue.

For this scenario, one actionable tip I would say is to make sure the head container only have serve controller and no models. You can run the models in worker containers instead. In the head container you can make sure to start Ray with ray start --head --num-cpus=0 so no models will be provisioned there. If the worker container gets killed, Ray can recover gracefully from that. The the head container gets killed, then the same issue will arise but at least we know the controller is actually OOMing.

By the way, can you elaborate on your test setting? is it on K8s? VM hosts?

al.exe · December 20, 2022, 2:53pm

Thanks for your input. I’m testing on a VM setup without k8s (can’t be used).

I experimented a bit with custom labels / resources and it was possible to separate the ServeController from the ServeReplicas:

I will check if this leads to a more stable deployment, esp. for OOM cases.

Topic		Replies	Views
Understanding Ray Serve Memory Consumption Ray Serve	6	819	December 14, 2023
Ray OOM issue when the ray serve is launched	4	551	December 14, 2022
Memory Leak in Ray Serve 2.2.0	0	232	January 13, 2023
How to control the total memory of ray.serve? Ray Serve	3	865	November 10, 2021
Ray Serving multiple Diffusers models leads to OOM Ray Serve	4	387	November 30, 2023

24/7 Soak test result: Ray can't recover from OOM errors

Related topics