GPU memory not cleared after trial

I’m doing hyperparameter optimization of a pytorch model using ray.tune, and I’m having an issue similar to the one described here:

tensorflow - Out of memory at every second trial using Ray Tune - Stack Overflow

I attempted to add the wait_for_gpu function, and according to the logs, gpu memory usage stays constant after 20 retries, at which time the function raises an error.

Is there a simple workaround here? Maybe something like the process described here:

GPU Support — Ray v1.6.0

in the section " Workers not Releasing GPU Resources", but for ray.tune?

Edit: Sleeping for 90s at the beginning of the objective function seems to have solved the issue, which makes me think there’s a problem with wait_for_gpu, because it was reporting constant gpu memory usage.

Could you share a repro script? Curious to learn why trials are holding on to GPU resources.

Apologies. Just saw this. Using a placement group factory fixed it.

Nice! Thanks for reporting back.