Under-utilization of gpus at end of experiment

I am using Tune for hyperparameter optimization for Pytorch Lightning (not using Ray Lightning). I have 8 GPUs and I am assigning 1 of them to each trial. As a result, at the end of the experiment, when I have less than 8 trials left, some GPUs remain idle. I would like the idle resources reallocated to the remaining trials. Is there a way to automatically make sure all resources are used at all times? It’s worth noting that my lightning trainer has accelerator=“auto”. Thank you!

Hey @Mike1, yes this is possible with the ResourceChangingScheduler API: Trial Schedulers (tune.schedulers) — Ray 2.2.0. This allows resources to be re-allocated during the run of the experiment and solves the exact issue that you are describing!

There is an example using this with XGBoost here: XGBoost Dynamic Resources Example — Ray 2.2.0, but the same idea can be applied to PyTorch Lightning.

This is still experimental though, so please let us know if you run into any issues!

1 Like

Thank you for the answer. I tried it and unfortunately, this small modification tends to produce many errors (about 70% of the trials end in error, and the types of errors are inconsistent). I then tried a different solution: run the trials analogically- giving all the resources to one trial, one at a time. However, I found out that assigning more than 1 gpu to a trial (via resources_per_trial in tune.with_resources) makes the trial not run at all: Tune reports the trial as running, but nothing happens- not a single iteration is run, and tune is stuck in an infinite loop. Any advice about how to make that work? Thank you!

Hey @Mike1, would you be able to share what your code looks like, as well as what is being printed out to stdout and the errors that you are seeing?