GPU memory not being freed every other trial in Ray Tune

davek · February 17, 2023, 3:30pm

Hi, I’m trying to use Ray Tune and I’m running into an issue in which the GPU memory is not being freed exactly every other trial. Originally I was getting some out of memory errors despite my training easily fitting in memory when not using Ray, so after seeing posts like this I started using tune.utils.wait_for_gpu()

It did not help…but it did make my problem more consistent I guess. Exactly every other trial I will get a “RuntimeError: GPU memory was not freed” from wait_for_gpu(). Trying to avoid this I have added some manual pytorch cache clearing, garbage collection, some sleep time before actually doing the wait_for_gpu, increasing the number of retries in it…and absolutely nothing has helped. My code looks something like:

def objective(params, config):
    torch.cuda.empty_cache()
    gc.collect()
    time.sleep(90)
    tune.utils.wait_for_gpu(retry=50)

    # Actual training/evaluation code after this

def run_hyper_param_opt(search_space, base_config):
    algo = TuneBOHB(metric="avg_score", mode="max")
    bohb = HyperBandForBOHB(
        time_attr="training_iteration",
        metric="avg_score",
        mode="max",
        max_t=100
    )

    trainable_with_gpu = tune.with_resources(partial(objective, config=base_config), {"gpu": 1})
    tuner = tune.Tuner(trainable_with_gpu,
                                       param_space=search_space,
                                       tune_config=tune.TuneConfig(
                                           search_algo=algo,
                                           scheduler=bohb,
                                           num_samples=100
                                       ))
    results = tuner.fit()

But nothing I do has helped. Exactly every other trials it claims the GPU memory was not freed, but for the following trial it works normally.

Can you please offer some advice as for what to do about this?

xwjiang2010 · February 17, 2023, 7:24pm

how many trials are you running and how many GPUs do you have?

davek · February 17, 2023, 7:30pm

It’s just running on one GPU and the problem starts on the 2nd trial. It’s very consistent. As I said it’s exactly every other trial. 1st trial works fine, 2nd trial gives the GPU memory not freed error, 3rd trial works fine, 4th trial gives error, 5th trial works fine, etc.

xwjiang2010 · February 21, 2023, 4:41pm

Hi,
Can you try setting reuse_actors=False?
What is the training library you are using?

Topic		Replies	Views
GPU memory not cleared after trial Ray Tune	3	1032	January 18, 2022
GPU Memory not clearing after one Ray tune task	2	449	September 14, 2023
GPU memory not released Ray Tune	13	1488	November 13, 2023
How to make all use of the GPU memory in Ray.tune	6	1330	December 6, 2022
Terminated Trials hold memory causing OOM Ray Tune	1	436	October 26, 2021

GPU memory not being freed every other trial in Ray Tune

Related topics