Distributed asynchronous hyperparameter tuning on queued resources

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

In cluster environments that are running queue based workload managers such as e.g. SLURM it is impractical to reserve blocks of resources that have to start at the same time and/or continue for a long time.

Luckily there are several algorithms that can either be slightly modified or doesn’t need this in the first place to work better on a such a system. Random sampling is the most obvious one that doesn’t need any extra work or even an extra packages such as Ray even. However, for others such as Bayesian Optimization and Population Based Training it would be helpful if this was available through Ray[Tune].

I haven’t worked with Ray previously and is spending some time looking around for existing implementations before I decide that I’ll have to implement everything myself from scratch.

To clarify a bit of what I’m looking for I would basically want to find any method for which you could start N jobs where each job is training only one model. In slurm this could look like something like:

sbatch --array=1-20%4 jobscript.sh

which would yield in total 20 training trajectories where at most 4 is running at the same time put possibly less based on availability and queueing times.

Is anything like this possible with Ray? (A simple no is also an acceptable answer, then I can at least focus on implementing it myself instead.)

Hi @VRehnberg,

If you’re using Ray Tune, you can set the max_concurrent_trials attribute in the TuneConfig to 4 to reduce the total maximum number of parallel trials to be 4. Ray Tune can be used with Bayesian Optimization and Population Based Training.

Let me know if this answers your question, or points you in the right direction!

Hi @matthewdeng,

No, the key part from that example is that the trials will not necessarily to start at the same time and not communicate with each other. I’m looking for a way to run the separate trials without communication between themselves or even a master process.

The examples I’ve seen so far seem to require that you allocate a fixed amount of computational resources for the entire duration of the parameter search. For larger computer clusters that are running some kind of job scheduler this is impractical as the scheduler will have a harder time to allocate one big job than the same job divided into 20 separate pieces. Thus, waiting time will be longer for the user and the resources will have more time idling which is a waste of resources.

For population based training writing something that works without (direct) communication should be reasonably simple by using the fact that these types of systems usually have a shared storage between compute nodes. It is then sufficient to append to some file for each epoch the metric you will base the hyperparameter optimization on, as well as saving the hyperparameters and best achieving models to file.

For Bayesian Optimization it will depend on how efficient the sampling and updating is. Because, the simplest way I’ve thought of so far would be to train Gaussian Process from scratch for each new trial. I suppose you could have a separate process running that continuously generates samples for each new entry coming in but then everything would be dependent on that process not failing which should be avoided if possible.