Cpu allocation confusion

Hi, I’m training with Tune (Ray 2.3.0) on a single machine that has 16 cpus and 1 gpu. Just getting a feel for how resource management works, I am explicitly setting
num_gpus = 0
num_gpus_per_worker = 0
num_cpus_for_local_worker = 1
num_cpus_per_worker = 1
num_rollout_workers = 1
rollout_fragment_length = 200
train_batch_size = 200 #must be = rollout_fragment_length * num_rollout_workers * num_envs_per_worker
sgc_minibatch_size = 32

Then I start the tune job with PPO and TuneConfig(num_samples = 24) and the PBT scheduler. What I see is that Ray aggressively spins up 8 workers immediately. Why isn’t it limited to 1, as specified? If it is going to ignore my limit request, why wouldn’t it try to use all 16 cpus (or 15 workers + driver)?

The confusion continues: When I change num_cpus_per_worker = 0 it runs a whopping 12 workers! Still not using all the resources, but no obvious explanation for the choice (I’m guessing that 0 tells Ray to do whatever it thinks is best).

And more confusion: When I change num_cpus_per_worker = 1 again, then
num_rollout_workers = 13
train_batch_size = 2600
It only fires up a single worker. I am now totally baffled. What is the rubric behind this worker/cpu allocation?

Thanks!

Hi @starkj,

Your current config with num_cpus_per_worker=1 and num_rollout_workers=1 will allocate 2 CPUs per RLlib trial: one actor with 1 CPU for training, and another actor for doing env rollouts with 1 CPU. So, Tune will allocate 8 * 2 = 16 CPUs worth of remote actors.

The num_rollout_workers is an RLlib config that defines how many rollout workers should be spawned per-trial, so Tune’s scheduling of running trials concurrently is not related to that. This section of the RLlib docs may be useful to clarify these things: Getting Started with RLlib — Ray 2.3.0

To limit the concurrency on the Tune side, you can set Tuner(tune_config=tune.TuneConfig(max_concurrent_trials=1)). See ray.tune.TuneConfig — Ray 2.3.0 for more info.

Let me know if that answers your questions!

Hi @justinvyu , thanks for the reply.

It sounds like every trial is associated with 1 rollout worker (using num_cpus_per_worker) plus an additional CPU, and then Ray automatically determines how many trials can be run in parallel based on the available cpus. ISo if I have num_cpus_per_worker = 2 then each trial will use 3 cpus? If I set num_cpus_per_worker = 0 does that force the trial to do everything on a single cpu?

was looking at it the other way round, where a specific number of workers is requested (num_rollout_workers), then as each one becomes available it is assigned to work on a new trial. After seeing your answer and re-reading the manual page, I still feel that the manual is delivering this perspective. A little more description & examples there would be most helpful!

A new confusion: I just tried setting num_gpus = 1 (it was 0), and left everything else the same as at the top of my post. Now the number of concurrent trials drops from 8 to 1. As I understand, the num_gpus param refers to what is available for the driver process (doing the learning algo), so I would think it doesn’t affect the rollout workers at all. Why did it do this?

Thanks again!

The resource allocation for actors is mainly for book-keeping purposes (for Ray to figure out how many things to schedule concurrently). Ray does not automatically provide resource isolation, so you should limit the number of CPUs used by your application logic (ex: setting the number of jobs in sklearn).

For the GPU question, your cluster only has 1 GPU, so you can only run one trial at a time if you have a resource request of X CPUs and 1 GPU. You can set fractional GPUs to get around this (in which case you should also make sure GPU memory usage is limited per actor).

See GPU Support — Ray 2.3.0