Confused on the behavior of "RAY_task_oom_retries" in Actor restarts

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

Hi all, while perusing the config.h file here ray/src/ray/common/ray_config_def.h at ray-2.3.0 · ray-project/ray · GitHub

I stumbled upon this environment flag RAY_task_oom_retries which is set to -1 by default. I am confused as to the implications of this flag on actor restarts.

  1. In the description this seems to imply that when an actor is killed by the memory monitor due to OOM (hence the actor also dies) dose this not count as a restart in the max_restarts? In which case I think this is not documented in the actor fault tolerance or actor page on the Ray docs because I was under the impression that any actor restart will decrease the max_restarts counter.

@cemk OOM retries and the max_retries are two separate counters. If the task failed due to OOM, it’ll use the oom counter. So if it’s set to 1, and it OOMed twice, it won’t retry.