ScalingConfig() num_workers not corresponding to training runs?

I am running a ray.tune.Tuner() that depends on a ScalingConfig(). My understanding from the documentation is that, when using only CPUs, the number of simultaneous training runs should be equal to num_workers.

However, I am observing the following: With num_workers=1 and 1 CPU per worker, 7 training runs launch simultaneously, and 8/8 of my CPUs are used (presumably 7 CPUs for the training runs and 1 CPU for ray.tune.Tuner(). With any other combination, only 1 training run launches. E.g., with num_workers=7 and 1 CPU per worker, or with num_workers=10 and 0.5 CPU per worker.

I do not think the problem is due to lack of CPUs, as e.g. 10 workers * 0.5 CPUs per worker = 5 CPUs, which is less than the 7 I have available. I also get no message about lacking resources.

I am doing all of this with a fresh install of Ray, so I believe my version is up to date.

My code:

scaling_config = ScalingConfig(num_workers=1, 
                               use_gpu=False,
                               resources_per_worker={"CPU": 1})

trainer = TorchTrainer(train_func, 
                       scaling_config=scaling_config)

tuner = ray.tune.Tuner(
    trainer,
    tune_config=tune.TuneConfig(
        num_samples=num_samples
    ),
    run_config=RunConfig(
        storage_path=storage_path,
        checkpoint_config=CheckpointConfig(
            checkpoint_score_attribute="val_loss",
            checkpoint_score_order="min",
            num_to_keep=1
            )   
        ),
    param_space=config
)

results = tuner.fit()

ScalingConfig.num_workers corresponds to the number of distributed workers for a single TorchTrainer. See here for an overview.

For Tune, the number of simultaneous training runs (aka trials) can be configured by TuneConfig.max_concurrent_trials.

I have increased max_concurrent_trials but I do not get any increase in my number of simultaneously running trials.

I have 7 available CPUs. I set max_concurrent_trials=n in TuneConfig(). I want to run n concurrent trials. How can I do this? It is not clear to me from the documentation.

@ray_user_11 The Ray Train “coordinator” actor actually takes up 1 CPU by default as well. So even if you have 1 worker, you’ll have 2 actors:

Coordinator (1 CPU [default])
  -> Worker (1 CPU [default, and from your setting]

The coordinator is not really doing any work (just lightweight event processing and making remote function calls), so you can set that to 0 CPUs.

scaling_config = ScalingConfig(num_workers=1, 
                               use_gpu=False,
+                              trainer_resources={"CPU": 0},
                               resources_per_worker={"CPU": 1})

Now, if your cluster has N CPUs, then each “TorchTrainer” instance will take up just a single CPU. You should be able to get 7 concurrent trials running at a time.

Is possible concurrent trials = number of cpus? Or can I use a fractional resources_per_worker, setting num_workers=1, trainer_resources={"CPU": 0}, resources_per_worker={"CPU": number_of_cpus / possible_concurrent_trials} to get whatever number of possible concurrent trials I want? How do I calculate the ceiling for possible concurrent trials given my hardware?

Setting fractional resources_per_worker has not led to any increase in number of concurrent trials for me in the past. Instead, it has decreased the number of concurrent trials.

I am encountering the same confusion. Not clear how to configure the number of trials properly via ScalingConfig when running with RayTrain

@ray_user_11 @EthanMarx

Ray will not stop you from scheduling as many concurrent trials as you want.

For example, let’s say I have a cluster with 16 CPUs.

This setup would give 16 CPUs / (2 CPUs per trial) = 8 concurrent trials:

num_workers = 2
cpus_per_worker = 1

trainer = TorchTrainer(
    ...,
    scaling_config=ScalingConfig(
        num_workers=2,
        trainer_resources={"CPU": 0},
        resources_per_worker={"CPU": cpus_per_worker},
    )
)
tune.Tuner(trainer, tune_config=tune.TuneConfig(num_samples=100), ...)

This setup would give 16 CPUs / (0.5 CPUs per trial) = 32 concurrent trials:

num_workers = 2
cpus_per_worker = 1

trainer = TorchTrainer(
    ...,
    scaling_config=ScalingConfig(
        num_workers=2,
        trainer_resources={"CPU": 0},
        resources_per_worker={"CPU": cpus_per_worker},
    )
)
tune.Tuner(trainer, tune_config=tune.TuneConfig(num_samples=100), ...)

In practice, you’ll be scheduling based on GPU availability, which is the limiting factor for your max concurrency. (each trial needs X whole GPUs)

What is the motivation for scaling up the number of workers in the ScalingConfig as opposed to adding more resources_per_worker?

How do these workers interact with the pytorch lightning workers that get launched during distributed training?

For example, what’s the difference between specifying


trainer = TorchTrainer(
    ...,
    scaling_config=ScalingConfig(
        num_workers=1
        trainer_resources={"CPU": 0},
        resources_per_worker={"GPU": 4},
    )
)

and

trainer = TorchTrainer(
    ...,
    scaling_config=ScalingConfig(
        num_workers=4
        trainer_resources={"CPU": 0},
        resources_per_worker={"GPU": 1},
    )
)

Am I correct in thinking that, in both of these cases, a cluster with 4 GPUs will launch only one trial?

Also, for my use case I am launching ray workers via kubernetes. So it’s possible that the GPUs are spread across multiple nodes

1 Like

@justinvyu

So I’ve configured a static ray kubernetes cluster with 1 node that contains 8 gpus and 68 cpus.
I launched a Tune job requesting 20 trials with the following ScalingConfig


scaling_config = ScalingConfig(
    trainer_resources={"CPU": 0},
    resources_per_worker={"CPU": 8, "GPU": 1},
    num_workers=1,
    use_gpu=True,
)

The tune job starts fine and initally schedules 8 jobs. However, eventually some of the jobs Error out and I see the following

Ignore this message if the cluster is autoscaling. No trial is running and no new trial has been started
 within the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 8.0 CPUs and
1.0 GPUs per trial, but the cluster only has 4.0 CPUs and 0 GPUs available. Stop the tuning and adjust the required resources (e
.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.

I am confused why the tuner is trying to schedule another trial when all of my resources are already taken. How can I fix this? Thank you!