Handling worker OOM when running Ray in K8s

yiranwang52 · February 9, 2021, 10:02pm

Currently, if a Ray worker runs out of memory, RayOutOfMemoryError would be raised, and the Ray program is supposed to be handle it, otherwise it could lead to unexpected behavior.
This is reasonable, but not a typical outcome for k8s user. K8s users are used to running single progress containers, and might expect the container to crash, and let k8s restart the pod automatically, instead of explicitly handling OOM.
Is there more k8s native solution to this problem?

ericl · February 9, 2021, 10:19pm

Probably the easiest solution is to disable the OOM handler in k8s. You can do this by setting the environment variable RAY_DEBUG_DISABLE_MEMORY_MONITOR=1.

This PR adds a tip to the error message on how to do this: Add tip on how to disable Ray OOM handler by ericl · Pull Request #14017 · ray-project/ray · GitHub

Topic		Replies	Views
Worker killed - OOM Ray Core	1	796	March 20, 2023
Sample ray program does not work on kubernetes with ray1.4.0 branch Kubernetes	1	480	June 10, 2021
Min memory that can be specified for ray kubernetes workers in ray1.4 Kubernetes	0	394	July 29, 2021
[Ray K8s cluster] - Script exit	0	308	July 8, 2023
[Core] How to reslove RayOutOfMemoryError in python for ray package? Ray Core	5	967	April 29, 2021

Handling worker OOM when running Ray in K8s

Related topics