Handling worker OOM when running Ray in K8s

Currently, if a Ray worker runs out of memory, RayOutOfMemoryError would be raised, and the Ray program is supposed to be handle it, otherwise it could lead to unexpected behavior.
This is reasonable, but not a typical outcome for k8s user. K8s users are used to running single progress containers, and might expect the container to crash, and let k8s restart the pod automatically, instead of explicitly handling OOM.
Is there more k8s native solution to this problem?

Probably the easiest solution is to disable the OOM handler in k8s. You can do this by setting the environment variable RAY_DEBUG_DISABLE_MEMORY_MONITOR=1.

This PR adds a tip to the error message on how to do this: Add tip on how to disable Ray OOM handler by ericl · Pull Request #14017 · ray-project/ray · GitHub

2 Likes