GPU memory not cleared after trial

hahdawg · September 7, 2021, 9:30pm

I’m doing hyperparameter optimization of a pytorch model using ray.tune, and I’m having an issue similar to the one described here:

tensorflow - Out of memory at every second trial using Ray Tune - Stack Overflow

I attempted to add the wait_for_gpu function, and according to the logs, gpu memory usage stays constant after 20 retries, at which time the function raises an error.

Is there a simple workaround here? Maybe something like the process described here:

GPU Support — Ray v1.6.0

in the section " Workers not Releasing GPU Resources", but for ray.tune?

Edit: Sleeping for 90s at the beginning of the objective function seems to have solved the issue, which makes me think there’s a problem with wait_for_gpu, because it was reporting constant gpu memory usage.

xwjiang2010 · October 26, 2021, 8:19pm

Could you share a repro script? Curious to learn why trials are holding on to GPU resources.

hahdawg · January 13, 2022, 9:03pm

Apologies. Just saw this. Using a placement group factory fixed it.

xwjiang2010 · January 18, 2022, 4:56pm

Nice! Thanks for reporting back.

Topic		Replies	Views
GPU Memory not clearing after one Ray tune task	2	456	September 14, 2023
GPU memory not being freed every other trial in Ray Tune	3	737	February 21, 2023
GPU memory not released Ray Tune	13	1579	November 13, 2023
How can I get the `gpu_id` assigned to the trial using the `trial_id`? Ray Tune	5	600	November 27, 2020
Terminated Trials hold memory causing OOM Ray Tune	1	442	October 26, 2021

GPU memory not cleared after trial

Related topics