Hi, I’m trying to use Ray Tune and I’m running into an issue in which the GPU memory is not being freed exactly every other trial. Originally I was getting some out of memory errors despite my training easily fitting in memory when not using Ray, so after seeing posts like this I started using tune.utils.wait_for_gpu()
It did not help…but it did make my problem more consistent I guess. Exactly every other trial I will get a “RuntimeError: GPU memory was not freed” from wait_for_gpu(). Trying to avoid this I have added some manual pytorch cache clearing, garbage collection, some sleep time before actually doing the wait_for_gpu, increasing the number of retries in it…and absolutely nothing has helped. My code looks something like:
def objective(params, config):
torch.cuda.empty_cache()
gc.collect()
time.sleep(90)
tune.utils.wait_for_gpu(retry=50)
# Actual training/evaluation code after this
def run_hyper_param_opt(search_space, base_config):
algo = TuneBOHB(metric="avg_score", mode="max")
bohb = HyperBandForBOHB(
time_attr="training_iteration",
metric="avg_score",
mode="max",
max_t=100
)
trainable_with_gpu = tune.with_resources(partial(objective, config=base_config), {"gpu": 1})
tuner = tune.Tuner(trainable_with_gpu,
param_space=search_space,
tune_config=tune.TuneConfig(
search_algo=algo,
scheduler=bohb,
num_samples=100
))
results = tuner.fit()
But nothing I do has helped. Exactly every other trials it claims the GPU memory was not freed, but for the following trial it works normally.
Can you please offer some advice as for what to do about this?