Required resources should be shared between train and eval workers

If I set "num_workers": 10 and "evaluation_num_workers": 10 then the required CPUs to start the training is going to be 20, but the 10-10 environment workers are used in sequence, i.e. the average utilisation is going to be 50% on a 20 CPU machine.

Is there a way around this? I set "num_cpus_per_worker": 0.5 with 20 train and eval workers which runs, but this it not ideal when there are other trials scheduled.

@sven1977 could you take a look at this?

Hey @vakker00 , great question. Just confirmed this myself: Trainer.train() and Trainer.evaluate() execute in sequence, never in parallel, even though you can - of course - parallelize the needed evaluation rollouts by specifying evaluation_num_workers > 1.

So yes, we will block all these CPUs, even though, we don’t need them all at the same time.

I wonder whether it would make sense to run evaluation in a separate thread (after having synched all weights to avoid race conditions). I’ll play around with this idea and get back to you. …

The other option would be to leave it sequential and tell tune to not reserve CPUs for evaluation (or reserve the max between num_workers and evaluation_num_workers).

Got it to run in parallel now (optional via a new config flag). I’ll PR. If not using parallelism, I’ll make the Trainer return reduced CPU resource requirements (b/c we never use evaluation workers and rollout workers at the same time).

PR: [RLlib] Support parallelizing evaluation and training (optional). by sven1977 · Pull Request #15040 · ray-project/ray · GitHub
The above PR will allow you to run evaluation and training in parallel.

Btw, we won’t be able to CPU-share between eval + train workers due to the restriction of having one actor per CPU. Tune would hang otherwise so the option I described above (for non-parallelized eval_training) won’t work. In other words, you will still need 20 CPUs for your setup.

Thanks @sven1977 for looking into this.

So just to confirm: using "num_cpus_per_worker": 0.5 won’t work because of the one actor/CPU restriction? I did manage to run a training with 10 workers and 0.5 CPU/worker and the requested resources is 21.5/24 CPU (I have 24 CPUs, not 20), which is (if I understand it correctly) 0.5*2*10+0.5+1, so two times 10 for the rollout and eval weighted 0.5, 0.5 for the model and 1 for the main process? If I remove the "num_cpus_per_worker": 0.5 then it doesn’t run.

So with the new PR the evaluation can run with (let’s say) 2 CPUs while the new training sampling can run with 18 (requiring 20 CPUs), right? Would this cause issues in case the eval doesn’t finish before it’s time for a new weight sync?