False positive assertion of OOM results in OOM-killer terminating Ray Tune trials

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I got the following error while using Ray Tune to do a hyperparameters tuning with 3 concurrent workers.

Memory on the node (IP: 192.168.2.110, ID: bd9319d3721cf0dcb7ad3a969a4d0667f755174de05699073397c448) where the task (actor ID: dbd6bbe3ab37410d9aaba2a001000000, name=ImplicitFunc.__init__, pid=5633, memory used=8.03GB) was running was 28.59GB / 68.26GB (0.418875)
, which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 62224722ee47fa7b8c078e0aea7559ec6c79081bd7b8df6f4d3faa07) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs rayle
t.out -ip 192.168.2.110`. To see the logs of the worker, use `ray logs worker-62224722ee47fa7b8c078e0aea7559ec6c79081bd7b8df6f4d3faa07*out -ip 192.168.2.110. Top 10 memory users:                                                                                     
PID     MEM(GB) COMMAND                                                                                                                                                                                                                                                
5633    8.03    ray::ImplicitFunc.train
5453    6.56    ray::ImplicitFunc.train
5536    6.12    ray::ImplicitFunc.train
# ... irrelevant processes with low memory usage

The strangest thing is that there was sufficient amount of memory but ray asserted the memory usage (41.8875%) exceeds the 95% limit and killed jobs…

It seems I can avoid this by reduce the number of parallel jobs at the cost of an increase of total running time. But it doesn’t make sense anyway and I want to know how to prevent this faulty behavior.

Hey @mk6, which verison of Ray are you running on, and do you have a simple repro script that I could run?

@mk6 There’s a user on slack who’s running into a similar issue and have some suggestions to confirm that it’s the problem: Slack

I think this is the known issue we are actively fixing it. [core] Critical Bug-fix: Fix ray memory metric calculation in CGroup V1 environment by WeichenXu123 · Pull Request #42508 · ray-project/ray · GitHub