I’m doing hyperparameter optimization of a pytorch model using ray.tune, and I’m having an issue similar to the one described here:
I attempted to add the
wait_for_gpu function, and according to the logs, gpu memory usage stays constant after 20 retries, at which time the function raises an error.
Is there a simple workaround here? Maybe something like the process described here:
in the section " Workers not Releasing GPU Resources", but for ray.tune?
Edit: Sleeping for 90s at the beginning of the objective function seems to have solved the issue, which makes me think there’s a problem with
wait_for_gpu, because it was reporting constant gpu memory usage.