GPU memory not being freed every other trial in Ray Tune

Hi, I’m trying to use Ray Tune and I’m running into an issue in which the GPU memory is not being freed exactly every other trial. Originally I was getting some out of memory errors despite my training easily fitting in memory when not using Ray, so after seeing posts like this I started using tune.utils.wait_for_gpu()

It did not help…but it did make my problem more consistent I guess. Exactly every other trial I will get a “RuntimeError: GPU memory was not freed” from wait_for_gpu(). Trying to avoid this I have added some manual pytorch cache clearing, garbage collection, some sleep time before actually doing the wait_for_gpu, increasing the number of retries in it…and absolutely nothing has helped. My code looks something like:

def objective(params, config):
    torch.cuda.empty_cache()
    gc.collect()
    time.sleep(90)
    tune.utils.wait_for_gpu(retry=50)

    # Actual training/evaluation code after this

def run_hyper_param_opt(search_space, base_config):
    algo = TuneBOHB(metric="avg_score", mode="max")
    bohb = HyperBandForBOHB(
        time_attr="training_iteration",
        metric="avg_score",
        mode="max",
        max_t=100
    )

    trainable_with_gpu = tune.with_resources(partial(objective, config=base_config), {"gpu": 1})
    tuner = tune.Tuner(trainable_with_gpu,
                                       param_space=search_space,
                                       tune_config=tune.TuneConfig(
                                           search_algo=algo,
                                           scheduler=bohb,
                                           num_samples=100
                                       ))
    results = tuner.fit()

But nothing I do has helped. Exactly every other trials it claims the GPU memory was not freed, but for the following trial it works normally.

Can you please offer some advice as for what to do about this?

how many trials are you running and how many GPUs do you have?

It’s just running on one GPU and the problem starts on the 2nd trial. It’s very consistent. As I said it’s exactly every other trial. 1st trial works fine, 2nd trial gives the GPU memory not freed error, 3rd trial works fine, 4th trial gives error, 5th trial works fine, etc.

Hi,
Can you try setting reuse_actors=False?
What is the training library you are using?