False positive assertion of OOM results in OOM-killer terminating Ray Tune trials

mk6 · January 7, 2024, 5:37am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I got the following error while using Ray Tune to do a hyperparameters tuning with 3 concurrent workers.

Memory on the node (IP: 192.168.2.110, ID: bd9319d3721cf0dcb7ad3a969a4d0667f755174de05699073397c448) where the task (actor ID: dbd6bbe3ab37410d9aaba2a001000000, name=ImplicitFunc.__init__, pid=5633, memory used=8.03GB) was running was 28.59GB / 68.26GB (0.418875)
, which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 62224722ee47fa7b8c078e0aea7559ec6c79081bd7b8df6f4d3faa07) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs rayle
t.out -ip 192.168.2.110`. To see the logs of the worker, use `ray logs worker-62224722ee47fa7b8c078e0aea7559ec6c79081bd7b8df6f4d3faa07*out -ip 192.168.2.110. Top 10 memory users:                                                                                     
PID     MEM(GB) COMMAND                                                                                                                                                                                                                                                
5633    8.03    ray::ImplicitFunc.train
5453    6.56    ray::ImplicitFunc.train
5536    6.12    ray::ImplicitFunc.train
# ... irrelevant processes with low memory usage

The strangest thing is that there was sufficient amount of memory but ray asserted the memory usage (41.8875%) exceeds the 95% limit and killed jobs…

It seems I can avoid this by reduce the number of parallel jobs at the cost of an increase of total running time. But it doesn’t make sense anyway and I want to know how to prevent this faulty behavior.

justinvyu · January 17, 2024, 11:24pm

Hey @mk6, which verison of Ray are you running on, and do you have a simple repro script that I could run?

justinvyu · January 22, 2024, 6:12pm

@mk6 There’s a user on slack who’s running into a similar issue and have some suggestions to confirm that it’s the problem: Slack

sangcho · January 24, 2024, 9:17am

I think this is the known issue we are actively fixing it. [core] Critical Bug-fix: Fix ray memory metric calculation in CGroup V1 environment by WeichenXu123 · Pull Request #42508 · ray-project/ray · GitHub

Topic		Replies	Views
Ray Tune jobs fails with no explicit reasons Ray Tune	12	617	April 12, 2023
Large (5x) difference in Ray AIR memory usage on different machines	4	454	January 12, 2023
Most runs immediately failing with "out of memory" Ray Tune	5	1263	May 11, 2021
Weird error logs when running Out Of Memory (OOM) Ray Core	6	2777	April 11, 2023
RayOutOfMemoryError: More than 95% of the memory is used Ray Core	6	4918	September 9, 2022

False positive assertion of OOM results in OOM-killer terminating Ray Tune trials

Related topics