GPU memory not released

Hello. I’m trying to run parallel trials in tune and if I specify too many (more than 8 at once) then the GPU does not release memory fast enough for other trials. I have 20 processors and 1 GPU. My setup looks like this:

num_parallel = 8

searcher = OptunaSearch(space=sample_params_ray, metric=eval_metric, mode=“max”)
algo = ConcurrencyLimiter(searcher, max_concurrent=num_parallel)

objective_func = tune.with_parameters(objective_ray, other_params=other_params)
objective_resources = tune.with_resources(objective_func, resources={“cpu”: 1, “gpu”: 1/num_parallel})

tuner = tune.Tuner(
objective_resources,
tune_config=tune.TuneConfig(
search_alg=algo,
num_samples=num_trials,
trial_dirname_creator=trial_str_creator,
chdir_to_trial_dir=True),
run_config=air.RunConfig(
local_dir=f"{os.getcwd()}/results/xg_tree/",
name=save_name,
failure_config=air.FailureConfig(max_failures=0),
verbose=1
)
)

results = tuner.fit()

If I set the gputil threshold, it works for around 20 trials and then the GPU memory just keeps adding up until the wait for gpu utility throws the memory not released error:

RuntimeError: GPU memory was not freed.

Thank you!

Hi @sjmitche9 , I notice that you are using fractional GPU in your resource spec, which allows multiple trials run on the same GPU. However, it is the user’s responsibility to make sure that the individual tasks don’t use more than their share of the GPU memory. Can you check how much GPU memory one trial is using and make sure they are not exceeding the fraction you set?

Reference: fractional-gpus

Thank you @yunxuanx that’s a good question. I’ll have to check, but sometimes the total GPU memory usage goes down, and other times it gets permanently stuck and more often than not they get stuck. If trials are in fact using more memory than they’re supposed to, do you have any recommendations?

Thank you!

Thanks @sjmitche9 . Generally, in order to reduce the GPU Memory usage, one can either reduce the model size or data size, or clear the CUDA cache more frequently:)

Thank you @yunxuanx! I’m actually having trouble clearing the memory cache. I tried a few ways but wasn’t successful. Do you have any ideas on where to start there? Also, if the GPU is using more memory than it’s supposed to, is there a fix for that? Is there a minimum fraction of the GPU that can be used per actor? do you know if the following statement is true or false?: If I want 10 actors running in parallel, and they each get 10% of the memory, then it shouldn’t accumulate because it would be cleared every time an actor starts a new trial?

Thanks again!

You may find this document helpful: Resources — Ray 2.4.0

There’s no fix on Ray side if the trial is using more memory than the provided threshold. The resource you specified is “logical resource” and only used for actor scheduling:

ray.init(num_gpus=3)

@ray.remote(num_gpus=0.5)
class FractionalGPUActor:
    def ping(self):
        print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))


fractional_gpu_actors = [FractionalGPUActor.remote() for _ in range(3)]
# Ray will try to pack GPUs if possible.
[ray.get(fractional_gpu_actors[i].ping.remote()) for i in range(3)]
# (FractionalGPUActor pid=57417) ray.get_gpu_ids(): [0]
# (FractionalGPUActor pid=57416) ray.get_gpu_ids(): [0]
# (FractionalGPUActor pid=57418) ray.get_gpu_ids(): [1]

In the above example, Ray will schedule two actors with the same GPU. However, it’s the users’ responsibility to make sure each actor uses no more than 50% GRAM. Ray has no minimum fraction for GPU, they will allocate as many actors as possible in a single GPU according to this fraction.

As for the question:
If I want 10 actors running in parallel, and they each get 10% of the memory, then it shouldn’t accumulate because it would be cleared every time an actor starts a new trial?

you have to specify reuse_actors=True in TuneConfig in order to run a new trial with the old actor.

cc @kai maybe better explain the behavior of reuse_actors? Thx

Thanks for the quick reply @yunxuanx! I will try that, because part of the problem is that using the GPUtil method to wait for the memory to be released sometimes gets stuck because the memory isn’t releasing. I’ll try reusing the actors. Thanks!

Hey @sjmitche9 , I just found an util function that would be helpful!

https://docs.ray.io/en/master/tune/api/doc/ray.tune.utils.wait_for_gpu.html#ray-tune-utils-wait-for-gpu

Thanks again @yunxuanx. I’ve been using the wait for gpu utility, but the problem is that the GPU memory isn’t being released, so it waits forever. Actually there is a threshold when it will eventually throw an error. I’ll try reusing the actors and hope that this clears out the memory after each trial. Thank you!

1 Like

@yunxuanx reusing actors was the solution! Thank you for your help!

1 Like

@sjmitche9 , could you please explain how did you fix the issue? I have the same problem and most of the trials are failed.

Hi Ramin. When you create the tuner object, use reuse_actors=True in the TuneConfig. See below:

tuner = tune.Tuner(
			objective_resources,
			tune_config=tune.TuneConfig(
				search_alg=algo, 
				num_samples=num_trials,
				trial_dirname_creator=trial_str_creator,
				chdir_to_trial_dir=True,
				reuse_actors=True)
    )
1 Like

Thank you for your help!
With reuse_actors=True still there are some failed trials that are failed with the following error:

Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
2023-11-13 02:30:13,621	ERROR tune_controller.py:911 -- Trial task failed for trial TensorflowTrainer_32706_00023
Traceback (most recent call last):