Worker killed - OOM

Hi there, I am using Ray 2.3.0 and following the suggestion from the documentation:

To disable worker killing, set the environment variable
``RAY_memory_monitor_refresh_ms to zero.

I have this configuration enabled as far I understand. Here is from the ray environment agent log file

232023-03-15 14:16:20,879 INFO -- Runtime env already created successfully. Env: {"RAY_PROFILING": 1, "RAY_memory_monitor_refresh_ms": 0, "RAY_task_events_report_interval_ms": 1000, "working_dir": "gcs://"}, context: {"command_prefix": ["cd", "/tmp/ray/session_2023-03-15_14-04-39_430545_8/runtime_resources/working_dir_files/_ray_pkg_732aaac1730758c4", "&&"], "env_vars": {"PYTHONPATH": "/tmp/ray/session_2023-03-15_14-04-39_430545_8/runtime_resources/working_dir_files/_ray_pkg_732aaac1730758c4"}, "py_executable": "/home/ray/anaconda3/bin/python", "resources_dir": null, "container": {}, "java_jars": []}

Why is my worker killed then?

7e[2me[33m(raylet, ip=[0m [2023-03-15 14:07:28,435 E 34 34] (raylet) 5 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 4c7000ebda3ee5cb89b22c7924d1cb9c5715c05b35d44b093535c0b3, IP: over the last time period. To see more information about the Workers killed on this node, use ray logs raylet.out -ip`

hi @ckapoor, one thing to confirm, when you start ray, do you see MemoryMonitor disabled. Specify ... in the raylet.out log? If that’s the case the error message might be red-herring, such as the worker is killed by an OS oom killer.

Let’s know your findings, we depending on the result we might fix it accordingly.