GPU memory not released

sjmitche9 · May 9, 2023, 7:44pm

Hello. I’m trying to run parallel trials in tune and if I specify too many (more than 8 at once) then the GPU does not release memory fast enough for other trials. I have 20 processors and 1 GPU. My setup looks like this:

num_parallel = 8

searcher = OptunaSearch(space=sample_params_ray, metric=eval_metric, mode=“max”)
algo = ConcurrencyLimiter(searcher, max_concurrent=num_parallel)

objective_func = tune.with_parameters(objective_ray, other_params=other_params)
objective_resources = tune.with_resources(objective_func, resources={“cpu”: 1, “gpu”: 1/num_parallel})

tuner = tune.Tuner(
objective_resources,
tune_config=tune.TuneConfig(
search_alg=algo,
num_samples=num_trials,
trial_dirname_creator=trial_str_creator,
chdir_to_trial_dir=True),
run_config=air.RunConfig(
local_dir=f"{os.getcwd()}/results/xg_tree/",
name=save_name,
failure_config=air.FailureConfig(max_failures=0),
verbose=1
)
)

results = tuner.fit()

If I set the gputil threshold, it works for around 20 trials and then the GPU memory just keeps adding up until the wait for gpu utility throws the memory not released error:

RuntimeError: GPU memory was not freed.

Thank you!

yunxuanx · May 13, 2023, 1:18am

Hi @sjmitche9 , I notice that you are using fractional GPU in your resource spec, which allows multiple trials run on the same GPU. However, it is the user’s responsibility to make sure that the individual tasks don’t use more than their share of the GPU memory. Can you check how much GPU memory one trial is using and make sure they are not exceeding the fraction you set?

Reference: fractional-gpus

sjmitche9 · May 13, 2023, 1:58am

Thank you @yunxuanx that’s a good question. I’ll have to check, but sometimes the total GPU memory usage goes down, and other times it gets permanently stuck and more often than not they get stuck. If trials are in fact using more memory than they’re supposed to, do you have any recommendations?

Thank you!

yunxuanx · May 15, 2023, 8:30pm

Thanks @sjmitche9 . Generally, in order to reduce the GPU Memory usage, one can either reduce the model size or data size, or clear the CUDA cache more frequently:)

sjmitche9 · May 15, 2023, 9:01pm

Thank you @yunxuanx! I’m actually having trouble clearing the memory cache. I tried a few ways but wasn’t successful. Do you have any ideas on where to start there? Also, if the GPU is using more memory than it’s supposed to, is there a fix for that? Is there a minimum fraction of the GPU that can be used per actor? do you know if the following statement is true or false?: If I want 10 actors running in parallel, and they each get 10% of the memory, then it shouldn’t accumulate because it would be cleared every time an actor starts a new trial?

Thanks again!

yunxuanx · May 15, 2023, 11:39pm

You may find this document helpful: Resources — Ray 2.4.0

There’s no fix on Ray side if the trial is using more memory than the provided threshold. The resource you specified is “logical resource” and only used for actor scheduling:

ray.init(num_gpus=3)

@ray.remote(num_gpus=0.5)
class FractionalGPUActor:
    def ping(self):
        print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))


fractional_gpu_actors = [FractionalGPUActor.remote() for _ in range(3)]
# Ray will try to pack GPUs if possible.
[ray.get(fractional_gpu_actors[i].ping.remote()) for i in range(3)]
# (FractionalGPUActor pid=57417) ray.get_gpu_ids(): [0]
# (FractionalGPUActor pid=57416) ray.get_gpu_ids(): [0]
# (FractionalGPUActor pid=57418) ray.get_gpu_ids(): [1]

In the above example, Ray will schedule two actors with the same GPU. However, it’s the users’ responsibility to make sure each actor uses no more than 50% GRAM. Ray has no minimum fraction for GPU, they will allocate as many actors as possible in a single GPU according to this fraction.

As for the question:
If I want 10 actors running in parallel, and they each get 10% of the memory, then it shouldn’t accumulate because it would be cleared every time an actor starts a new trial?

you have to specify reuse_actors=True in TuneConfig in order to run a new trial with the old actor.

yunxuanx · May 15, 2023, 11:46pm

cc @kai maybe better explain the behavior of reuse_actors? Thx

sjmitche9 · May 16, 2023, 3:34am

Thanks for the quick reply @yunxuanx! I will try that, because part of the problem is that using the GPUtil method to wait for the memory to be released sometimes gets stuck because the memory isn’t releasing. I’ll try reusing the actors. Thanks!

yunxuanx · May 17, 2023, 12:09am

Hey @sjmitche9 , I just found an util function that would be helpful!

https://docs.ray.io/en/master/tune/api/doc/ray.tune.utils.wait_for_gpu.html#ray-tune-utils-wait-for-gpu

sjmitche9 · May 17, 2023, 1:32am

Thanks again @yunxuanx. I’ve been using the wait for gpu utility, but the problem is that the GPU memory isn’t being released, so it waits forever. Actually there is a threshold when it will eventually throw an error. I’ll try reusing the actors and hope that this clears out the memory after each trial. Thank you!

sjmitche9 · May 25, 2023, 2:27am

@yunxuanx reusing actors was the solution! Thank you for your help!

Ramin_Nateghi · November 10, 2023, 5:17pm

@sjmitche9 , could you please explain how did you fix the issue? I have the same problem and most of the trials are failed.

sjmitche9 · November 10, 2023, 9:15pm

Hi Ramin. When you create the tuner object, use reuse_actors=True in the TuneConfig. See below:

tuner = tune.Tuner(
			objective_resources,
			tune_config=tune.TuneConfig(
				search_alg=algo, 
				num_samples=num_trials,
				trial_dirname_creator=trial_str_creator,
				chdir_to_trial_dir=True,
				reuse_actors=True)
    )

Ramin_Nateghi · November 13, 2023, 2:33am

Thank you for your help!
With reuse_actors=True still there are some failed trials that are failed with the following error:

Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
2023-11-13 02:30:13,621	ERROR tune_controller.py:911 -- Trial task failed for trial TensorflowTrainer_32706_00023
Traceback (most recent call last):

Topic		Replies	Views
GPU memory not being freed every other trial in Ray Tune	3	735	February 21, 2023
GPU memory not cleared after trial Ray Tune	3	1040	January 18, 2022
How to make all use of the GPU memory in Ray.tune	6	1347	December 6, 2022
Gpu wise memory allocation Ray Tune	0	455	December 16, 2020
Training trials in parallel on multi-gpu machine Ray Tune	8	1754	August 23, 2021

GPU memory not released

Related topics