ScalingConfig() num_workers not corresponding to training runs?

I am running a ray.tune.Tuner() that depends on a ScalingConfig(). My understanding from the documentation is that, when using only CPUs, the number of simultaneous training runs should be equal to num_workers.

However, I am observing the following: With num_workers=1 and 1 CPU per worker, 7 training runs launch simultaneously, and 8/8 of my CPUs are used (presumably 7 CPUs for the training runs and 1 CPU for ray.tune.Tuner(). With any other combination, only 1 training run launches. E.g., with num_workers=7 and 1 CPU per worker, or with num_workers=10 and 0.5 CPU per worker.

I do not think the problem is due to lack of CPUs, as e.g. 10 workers * 0.5 CPUs per worker = 5 CPUs, which is less than the 7 I have available. I also get no message about lacking resources.

I am doing all of this with a fresh install of Ray, so I believe my version is up to date.

My code:

scaling_config = ScalingConfig(num_workers=1, 
                               use_gpu=False,
                               resources_per_worker={"CPU": 1})

trainer = TorchTrainer(train_func, 
                       scaling_config=scaling_config)

tuner = ray.tune.Tuner(
    trainer,
    tune_config=tune.TuneConfig(
        num_samples=num_samples
    ),
    run_config=RunConfig(
        storage_path=storage_path,
        checkpoint_config=CheckpointConfig(
            checkpoint_score_attribute="val_loss",
            checkpoint_score_order="min",
            num_to_keep=1
            )   
        ),
    param_space=config
)

results = tuner.fit()

ScalingConfig.num_workers corresponds to the number of distributed workers for a single TorchTrainer. See here for an overview.

For Tune, the number of simultaneous training runs (aka trials) can be configured by TuneConfig.max_concurrent_trials.

I have increased max_concurrent_trials but I do not get any increase in my number of simultaneously running trials.

I have 7 available CPUs. I set max_concurrent_trials=n in TuneConfig(). I want to run n concurrent trials. How can I do this? It is not clear to me from the documentation.

@ray_user_11 The Ray Train “coordinator” actor actually takes up 1 CPU by default as well. So even if you have 1 worker, you’ll have 2 actors:

Coordinator (1 CPU [default])
  -> Worker (1 CPU [default, and from your setting]

The coordinator is not really doing any work (just lightweight event processing and making remote function calls), so you can set that to 0 CPUs.

scaling_config = ScalingConfig(num_workers=1, 
                               use_gpu=False,
+                              trainer_resources={"CPU": 0},
                               resources_per_worker={"CPU": 1})

Now, if your cluster has N CPUs, then each “TorchTrainer” instance will take up just a single CPU. You should be able to get 7 concurrent trials running at a time.

Is possible concurrent trials = number of cpus? Or can I use a fractional resources_per_worker, setting num_workers=1, trainer_resources={"CPU": 0}, resources_per_worker={"CPU": number_of_cpus / possible_concurrent_trials} to get whatever number of possible concurrent trials I want? How do I calculate the ceiling for possible concurrent trials given my hardware?

Setting fractional resources_per_worker has not led to any increase in number of concurrent trials for me in the past. Instead, it has decreased the number of concurrent trials.