TL;DR: I want to add N trials to my tuner (all have to be in PENDING/PAUSED state) and execute M in parallel until I pause them again.
I work with a setup similar to population based training with forking the currently best trials. I want to train N trials evenly, M in parallel and perturb them every s steps.
I wonder how I can define M. If I use TuneConfig.max_concurrent_trials then M trials are trained until the final step. This is not what I want.
With ray.init(num_cpu=M*cpu_per_trial) I can kind of control it but then need to figure out cpu_per_trial - also possibly but adds more code to my entry point.
On top I also have an extra process that I might need to start so it is actually M*cpu_per_trial+1, but if cpu_per_trial=1 then this extra cpu is actually consumed by a trial instead of my other process.
Can anyone help me here with an idea how to set this up, possibly without the need to touch num_cpus?
It seems likely that the most direct way to control the number of parallel Ray Tune trials (M) is by setting TuneConfig(max_concurrent_trials=M), which limits the number of trials running at once, regardless of how many are in PENDING/PAUSED states. However, as you noted, this will keep M trials running until completion, not allowing you to pause and resume them arbitrarily—so it doesn’t provide fine-grained control over trial scheduling or pausing mid-training. Adjusting ray.init(num_cpus=...) and cpu_per_trial is another approach, but as you mentioned, it requires careful resource accounting and can be cumbersome, especially with extra processes involved. According to the Ray Tune resources guide, these are the main supported mechanisms.
My understanding is that Ray Tune does not natively support a “trial pool” where you can have N trials in PENDING/PAUSED and dynamically select any M to run at a time, pausing/resuming at will, without either (a) limiting via max_concurrent_trials or (b) managing resources at the Ray cluster level. You might need to implement custom logic (e.g., using a custom scheduler or external orchestration) to achieve this behavior. Would you like more detail on possible workarounds or custom scheduler approaches?
@RunLLM First give me a possible workaround, then give me a second solution with a custom scheduler approach that lets me define the number of trials that are added.
A possible workaround is to use a ConcurrencyLimiter on your search algorithm, which allows you to set max_concurrent to M. This limits the number of concurrently running trials, while still allowing you to have N trials in total (PENDING/PAUSED). You can pause/resume trials by controlling their status, but the ConcurrencyLimiter will ensure that no more than M are RUNNING at any time.
For a custom scheduler approach, you can subclass TrialScheduler and override the choose_trial_to_run method. In your implementation, you can maintain your own logic to select which trials to run (e.g., always pick from PAUSED trials first, or implement your own round-robin or priority logic). This gives you full control over which and how many trials are started, paused, or resumed at any time, independent of the default TuneConfig or resource settings.
Would you like a code example for either approach?