I am running a ray.tune.Tuner()
that depends on a ScalingConfig()
. My understanding from the documentation is that, when using only CPUs, the number of simultaneous training runs should be equal to num_workers
.
However, I am observing the following: With num_workers=1
and 1 CPU per worker, 7 training runs launch simultaneously, and 8/8 of my CPUs are used (presumably 7 CPUs for the training runs and 1 CPU for ray.tune.Tuner()
. With any other combination, only 1 training run launches. E.g., with num_workers=7
and 1 CPU per worker, or with num_workers=10
and 0.5 CPU per worker.
I do not think the problem is due to lack of CPUs, as e.g. 10 workers * 0.5 CPUs per worker = 5 CPUs, which is less than the 7 I have available. I also get no message about lacking resources.
I am doing all of this with a fresh install of Ray, so I believe my version is up to date.
My code:
scaling_config = ScalingConfig(num_workers=1,
use_gpu=False,
resources_per_worker={"CPU": 1})
trainer = TorchTrainer(train_func,
scaling_config=scaling_config)
tuner = ray.tune.Tuner(
trainer,
tune_config=tune.TuneConfig(
num_samples=num_samples
),
run_config=RunConfig(
storage_path=storage_path,
checkpoint_config=CheckpointConfig(
checkpoint_score_attribute="val_loss",
checkpoint_score_order="min",
num_to_keep=1
)
),
param_space=config
)
results = tuner.fit()