I’m doing hyperparameter optimization of a pytorch model using ray.tune, and I’m having an issue similar to the one described here:
tensorflow - Out of memory at every second trial using Ray Tune - Stack Overflow
I attempted to add the wait_for_gpu
function, and according to the logs, gpu memory usage stays constant after 20 retries, at which time the function raises an error.
Is there a simple workaround here? Maybe something like the process described here:
in the section " Workers not Releasing GPU Resources", but for ray.tune?
Edit: Sleeping for 90s at the beginning of the objective function seems to have solved the issue, which makes me think there’s a problem with wait_for_gpu
, because it was reporting constant gpu memory usage.