24/7 Soak test result: Ray can't recover from OOM errors

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

I’m currently evaluating ray serve as an option for serving models 24/7 (here a tf model). As long as enough memory is present, Ray seems to be stable, however Ray will eventually run out of memory and will try to restart the deployments in a ever shortening interval what leads to a spiral of death, where the majority of rest requests will fail.

Here we can see, that deployment fails faster over time until we end up in a loop of creation and destruction of deployments.

We can also see that ray kills processes, which results in freeing up memory, until the memory is used up again by the new deployment. As other ray processes seem to use more memory over time, the deployments run out of memory faster.

[See figure 2 below, as only one image is allowed per post.]

As the internal ray oom memory management leads to an instable state, where no requests can be served any more, I’m wondering what’s the recommendation for a 24/7h setup. Is the only option to restart the cluster periodically or should the oom be handled by docker (restarting the container)?

Processes which are killed:

106089  0.73GiB ray::ServeReplica:my_model
105970  0.19GiB ray::ServeReplica:Supervisor

One can also observe that memory increases for the replicas/deployments which are not killed:

E.g. 226 0.08GiB ray::ServeController.listen_for_change()

INFO 2022-12-10 18:07:32,970 controller 226 deployment_state.py:548 - Health check for replica mymodel_model#KPWydA failed: ^[[36mray::ServeReplica:mymodel_model.check_health()^[[39m (pid=300, ip=172.17.0.2, repr=<ray.serve._private.replica.ServeReplica:mymodel_model object at 0x7fdbb8155ee0>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 6c7c1a870467 is used (1.52 / 1.6 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
300     0.76GiB ray::ServeReplica:my_model
301     0.2GiB  ray::ServeReplica:Supervisor
226     0.08GiB ray::ServeController.listen_for_change()
8       0.07GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0
267     0.07GiB ray::HTTPProxyActor
58      0.06GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=0.
"controller_226.log" [readonly] 157532L, 12451357C
175     0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-add
267     0.06GiB ray::HTTPProxyActor
10      0.06GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
31      0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py -
123     0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-d
8       0.02GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0

In addition, up to 0.36 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---
INFO 2022-12-13 10:14:32,689 controller 226 deployment_state.py:548 - Health check for replica Supervisor#HXfqgf failed: ^[[36mray::ServeReplica:Supervisor.check_health()^[[39m (pid=105970, ip=172.17.0.2, repr=<ray.serve._private.replica.ServeReplica:Supervisor object at 0x7fb8c07b4ee0>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 6c7c1a870467 is used (1.58 / 1.6 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
106089  0.73GiB ray::ServeReplica:my_model
105970  0.19GiB ray::ServeReplica:Supervisor
226     0.19GiB ray::ServeController.listen_for_change()
58      0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=0.
175     0.09GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-add
267     0.06GiB ray::HTTPProxyActor
10      0.06GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
31      0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py -
123     0.03GiB /usr/local/bin/python -u /usr/local/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-d
8       0.02GiB /usr/local/bin/python /usr/local/bin/ray start --head --block --dashboard-host 0.0.0.0

In addition, up to 0.36 GiB of shared memory is currently being used by the Ray object store.

Thanks for any guidance.

Edit: Added images

Hi @al.exe, could you try re-uploading your images? I can’t see them on my end.

Figure 2 (Memory usage):

Thanks @al.exe, would you happen to have a reproduce script?

Hi,

I can’t boil it down to a simple script atm, but I believe on can use the default example using fastapi + a tensorflow model to reproduce the behavior (will verify it when i have some time).

However on a conceptual level, one can summarize it in the following way:

When we start the head node in a memory-limited container we might have the following three processes:

0 HTTPProxyActor
1 ServeReplica:MyModel
2 ServeController

Over time MyModel will increase in memory and will eventually be restarted.
However, other processes like the ServeController also increases its memory usage, and since the overall available memory decreases, the restarted MyModel Replica will run out of memory sooner.

One solution could be to restart ServeController as well (e.g. as it might contain some memory leak), with the default logic only the process which most memory consumption will be restarted though. In general ray processes which might have memory leaks might need to be restarted as well since not enough memory might be left for the other processes, e.g. replica which serves the model.

I used Ray 2.1 (also tried the alpha memory monitor, but this lead to a stop of the head node after some worker failures).

Hi @al.exe, one option is to isolate the issue here. The ServeController process should not leak memory and we have test to ensure that. The memory should growth to a steady state and then become stable. If that’s not the case, we will always treat it as a high priority issue.

For this scenario, one actionable tip I would say is to make sure the head container only have serve controller and no models. You can run the models in worker containers instead. In the head container you can make sure to start Ray with ray start --head --num-cpus=0 so no models will be provisioned there. If the worker container gets killed, Ray can recover gracefully from that. The the head container gets killed, then the same issue will arise but at least we know the controller is actually OOMing.

By the way, can you elaborate on your test setting? is it on K8s? VM hosts?

Thanks for your input. I’m testing on a VM setup without k8s (can’t be used).

I experimented a bit with custom labels / resources and it was possible to separate the ServeController from the ServeReplicas:

I will check if this leads to a more stable deployment, esp. for OOM cases.