I am using RayTune, with a 12GPU cluster and 9 points search grid. Each trial uses 4GPU and there are three parallel trials. The worker heap memory usage is pretty stable (70%) during the first batch of three trials; however, after the first batch is done, the second batch of trial all failed with OOM. It seems the memory was not released properly after the individual trial has finished.
Running into the same issue. I have Ray Syncer enabled to sync logs and checkpoints to HDFS for every trial from the worker to the HDFS directory, the memory keeps on increasing on the workers and is not released when the trial is finished. The subsequent trials are failing as there is not enough memory available on the worker.
Is there a configuration to explicitly release the memory once the trial is finished?
Any suggestions would be highly appreciated, thank you!
can you share which setup you are using - are you using e.g. BOHB or PBT for searching? There was an issue recently that leaked memory when using schedulers that pause trials. It has since been fixed on master - so you may want to try out latest nightly wheels to see if this solves your problem.