Under-utilization of gpus at end of experiment

Mike1 · December 20, 2022, 6:24pm

I am using Tune for hyperparameter optimization for Pytorch Lightning (not using Ray Lightning). I have 8 GPUs and I am assigning 1 of them to each trial. As a result, at the end of the experiment, when I have less than 8 trials left, some GPUs remain idle. I would like the idle resources reallocated to the remaining trials. Is there a way to automatically make sure all resources are used at all times? It’s worth noting that my lightning trainer has accelerator=“auto”. Thank you!

amogkam · December 20, 2022, 7:07pm

Hey @Mike1, yes this is possible with the ResourceChangingScheduler API: Trial Schedulers (tune.schedulers) — Ray 2.2.0. This allows resources to be re-allocated during the run of the experiment and solves the exact issue that you are describing!

There is an example using this with XGBoost here: XGBoost Dynamic Resources Example — Ray 2.2.0, but the same idea can be applied to PyTorch Lightning.

This is still experimental though, so please let us know if you run into any issues!

Mike1 · December 21, 2022, 5:41pm

Thank you for the answer. I tried it and unfortunately, this small modification tends to produce many errors (about 70% of the trials end in error, and the types of errors are inconsistent). I then tried a different solution: run the trials analogically- giving all the resources to one trial, one at a time. However, I found out that assigning more than 1 gpu to a trial (via resources_per_trial in tune.with_resources) makes the trial not run at all: Tune reports the trial as running, but nothing happens- not a single iteration is run, and tune is stuck in an infinite loop. Any advice about how to make that work? Thank you!

amogkam · December 21, 2022, 11:46pm

Hey @Mike1, would you be able to share what your code looks like, as well as what is being printed out to stdout and the errors that you are seeing?

Topic		Replies	Views
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	15	December 18, 2024
"Tune detects GPUs" warning trigger even though GPU is requested in resources_per_trial Ray Tune	1	722	September 2, 2021
Pytorch uses only one cpu per trial Ray Tune	2	554	December 3, 2021
GPU memory not cleared after trial Ray Tune	3	1032	January 18, 2022
Best Practices for Optimizing Ray Tune Trials RLlib	2	23	June 19, 2025

Under-utilization of gpus at end of experiment

Related topics