Ray Tune worker heap memory not released after individual trial completed

mizhazha · April 12, 2023, 2:50am

High: It blocks me to complete my task.

I am using RayTune, with a 12GPU cluster and 9 points search grid. Each trial uses 4GPU and there are three parallel trials. The worker heap memory usage is pretty stable (70%) during the first batch of three trials; however, after the first batch is done, the second batch of trial all failed with OOM. It seems the memory was not released properly after the individual trial has finished.

Any suggestion? Thanks in advance!

saivivek15 · April 27, 2023, 10:01pm

Running into the same issue. I have Ray Syncer enabled to sync logs and checkpoints to HDFS for every trial from the worker to the HDFS directory, the memory keeps on increasing on the workers and is not released when the trial is finished. The subsequent trials are failing as there is not enough memory available on the worker.

Is there a configuration to explicitly release the memory once the trial is finished?

Any suggestions would be highly appreciated, thank you!

kai · May 1, 2023, 10:06am

Hi @saivivek15 and @mizhazha,

can you share which setup you are using - are you using e.g. BOHB or PBT for searching? There was an issue recently that leaked memory when using schedulers that pause trials. It has since been fixed on master - so you may want to try out latest nightly wheels to see if this solves your problem.

Topic		Replies	Views
GPU memory not released Ray Tune	13	1421	November 13, 2023
GPU memory not cleared after trial Ray Tune	3	1022	January 18, 2022
Most runs immediately failing with "out of memory" Ray Tune	5	1202	May 11, 2021
Ray Tune trials getting stuck in a deadlock Ray Tune	1	816	September 13, 2023
GPU memory not being freed every other trial in Ray Tune	3	706	February 21, 2023

Ray Tune worker heap memory not released after individual trial completed

Related topics