Under-utilization of gpus at end of experiment

I am using Tune for hyperparameter optimization for Pytorch Lightning (not using Ray Lightning). I have 8 GPUs and I am assigning 1 of them to each trial. As a result, at the end of the experiment, when I have less than 8 trials left, some GPUs remain idle. I would like the idle resources reallocated to the remaining trials. Is there a way to automatically make sure all resources are used at all times? It’s worth noting that my lightning trainer has accelerator=“auto”. Thank you!

Hey @Mike1, yes this is possible with the ResourceChangingScheduler API: Trial Schedulers (tune.schedulers) — Ray 2.2.0. This allows resources to be re-allocated during the run of the experiment and solves the exact issue that you are describing!

There is an example using this with XGBoost here: XGBoost Dynamic Resources Example — Ray 2.2.0, but the same idea can be applied to PyTorch Lightning.

This is still experimental though, so please let us know if you run into any issues!

Thank you for the answer. I tried it and unfortunately, this small modification tends to produce many errors (about 70% of the trials end in error, and the types of errors are inconsistent). I then tried a different solution: run the trials analogically- giving all the resources to one trial, one at a time. However, I found out that assigning more than 1 gpu to a trial (via resources_per_trial in tune.with_resources) makes the trial not run at all: Tune reports the trial as running, but nothing happens- not a single iteration is run, and tune is stuck in an infinite loop. Any advice about how to make that work? Thank you!

Hey @Mike1, would you be able to share what your code looks like, as well as what is being printed out to stdout and the errors that you are seeing?