How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I got the following error while using Ray Tune to do a hyperparameters tuning with 3 concurrent workers.
Memory on the node (IP: 192.168.2.110, ID: bd9319d3721cf0dcb7ad3a969a4d0667f755174de05699073397c448) where the task (actor ID: dbd6bbe3ab37410d9aaba2a001000000, name=ImplicitFunc.__init__, pid=5633, memory used=8.03GB) was running was 28.59GB / 68.26GB (0.418875)
, which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 62224722ee47fa7b8c078e0aea7559ec6c79081bd7b8df6f4d3faa07) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs rayle
t.out -ip 192.168.2.110`. To see the logs of the worker, use `ray logs worker-62224722ee47fa7b8c078e0aea7559ec6c79081bd7b8df6f4d3faa07*out -ip 192.168.2.110. Top 10 memory users:
PID MEM(GB) COMMAND
5633 8.03 ray::ImplicitFunc.train
5453 6.56 ray::ImplicitFunc.train
5536 6.12 ray::ImplicitFunc.train
# ... irrelevant processes with low memory usage
The strangest thing is that there was sufficient amount of memory but ray asserted the memory usage (41.8875%) exceeds the 95% limit and killed jobs…
It seems I can avoid this by reduce the number of parallel jobs at the cost of an increase of total running time. But it doesn’t make sense anyway and I want to know how to prevent this faulty behavior.