How to correclty allocate resources with Tune + TorchTrainer on Slurm

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,

I am using Ray 2.2.0 on a slurm cluster but have several problems regarding correct resource allocation. In this exemple, I reserve 4 nodes each which 10 CPUs and 2 GPUs with

#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --gpus-per-task=2

and I want to use in parallel the 4 nodes to perform the tuning of my model. For each training I want to use 8 CPUs and 2 GPUs with DistributedDataParallel. Therefore I have in the main function:

trainer = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        scaling_config=ScalingConfig(
            resources_per_worker={"CPU": 8, "GPU": 2},
            use_gpu=True,
            num_workers=1,
            _max_cpu_fraction_per_node=0.8
        )
    )
    tuner = tune.Tuner(
        trainer,
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples
        ),
	    param_space=config,
        run_config=RunConfig(
            local_dir="./results", 
            name="test_cifar", 
            sync_config=ray.tune.SyncConfig(sync_on_checkpoint=False))
    )
    results = tuner.fit()

and the training function is

def train_loop_per_worker(config):
    print("Local rank", session.get_local_rank())
    print("World size: ", session.get_world_size())
    net = train.torch.prepare_model(net)
    ...
    trainloader = train.torch.prepare_data_loader(trainloader)
    valloader = train.torch.prepare_data_loader(valloader)
    ...

As you see in the status report that follows only two trials run in parallel and each of them apparently uses 9 CPU and 2 GPU. But it seems that the models are loaded and trained just on a single GPU since session.get_world_size() is always 1 (see log below).

== Status ==
Current time: 2022-12-16 16:13:51 (running for 00:03:41.15)
Memory usage on this node: 28.4/187.2 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: -1.2921145160436631 | Iter 4.000: -1.3493594537615776 | Iter 2.000: -1.4290564769506455 | Iter 1.000: -1.5200191211223602
Resources requested: 18.0/40 CPUs, 4.0/8 GPUs, 0.0/433.98 GiB heap, 0.0/189.98 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: bb708_00001 with loss=1.2228425681114197 and parameters={'train_loop_config': {'l1': 32, 'l2': 8, 'lr': 0.0028581659786108215, 'batch_size': 8}}
Result logdir: /gpfsdswork/projects/rech/mwh/ujn44cd/rayTune/results/test_cifar
Number of trials: 10/10 (3 PENDING, 2 RUNNING, 5 TERMINATED)
+--------------------------+------------+---------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+---------$
| Trial name               | status     | loc                 |   train_loop_config/ba |   train_loop_config/l1 |   train_loop_config/l2 |   train_loop_config/lr |   iter |   total time (s) |    loss |   accura$
|                          |            |                     |               tch_size |                        |                        |                        |        |                  |         |	  $
|--------------------------+------------+---------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+---------$
| TorchTrainer_bb708_00005 | RUNNING    | 10.148.7.112:132639 |                      4 |                    128 |                     16 |            0.0180833   |        |                  |         |	  $
| TorchTrainer_bb708_00006 | RUNNING    | 10.148.8.98:4021279 |                      2 |                     16 |                    256 |            0.088919    |        |                  |         |	  $
| TorchTrainer_bb708_00007 | PENDING    |                     |                      8 |                      8 |                     32 |            0.000215026 |        |                  |         |	  $
| TorchTrainer_bb708_00008 | PENDING    |                     |                      4 |                      4 |                    128 |            0.0144106   |        |                  |         |	  $
| TorchTrainer_bb708_00009 | PENDING    |                     |                      4 |                      4 |                    128 |            0.00260305  |        |                  |         |	  $
| TorchTrainer_bb708_00000 | TERMINATED | 10.148.8.97:177132  |                      4 |                      4 |                     64 |            0.0918294   |	 1 |          42.4549 | 2.33702 |     0.09$
| TorchTrainer_bb708_00001 | TERMINATED | 10.148.8.99:605598  |                      8 |                     32 |                      8 |            0.00285817  |     10 |         181.016  | 1.22284 |     0.58$
| TorchTrainer_bb708_00002 | TERMINATED | 10.148.7.112:130711 |                      8 |                     32 |                    256 |            0.00518242  |	 2 |          43.8126 | 1.42906 |     0.50$
| TorchTrainer_bb708_00003 | TERMINATED | 10.148.8.97:178801  |                      8 |                     32 |                      8 |            0.00472392  |	 2 |          43.7392 | 1.46415 |     0.45$
| TorchTrainer_bb708_00004 | TERMINATED | 10.148.7.112:132584 |                     16 |                      8 |                     32 |            0.0032123   |	 1 |          18.5921 | 1.65015 |     0.38$
+--------------------------+------------+---------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+---------$



A part of the log is :

^[[2m^[[36m(RayTrainWorker pid=129533, ip=10.159.48.31)^[[0m 2022-12-16 16:10:20,853    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=129533, ip=10.159.48.31)^[[0m 2022-12-16 16:10:21,419    INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=4014276, ip=10.159.48.65)^[[0m 2022-12-16 16:10:26,291   INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=4014276, ip=10.159.48.65)^[[0m 2022-12-16 16:10:26,854   INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=177233, ip=10.159.48.64)^[[0m 2022-12-16 16:11:13,360    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=177233, ip=10.159.48.64)^[[0m 2022-12-16 16:11:13,959    INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=130787, ip=10.159.48.31)^[[0m 2022-12-16 16:12:03,403    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=130787, ip=10.159.48.31)^[[0m 2022-12-16 16:12:03,947    INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=178936, ip=10.159.48.64)^[[0m 2022-12-16 16:12:55,372    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=178936, ip=10.159.48.64)^[[0m 2022-12-16 16:12:55,889    INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=179804, ip=10.159.48.64)^[[0m 2022-12-16 16:13:20,351    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=179804, ip=10.159.48.64)^[[0m 2022-12-16 16:13:20,878    INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=605738, ip=10.159.48.66)^[[0m 2022-12-16 16:13:35,419    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=605738, ip=10.159.48.66)^[[0m 2022-12-16 16:13:35,977    INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=180681, ip=10.159.48.64)^[[0m 2022-12-16 16:14:08,362    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=180681, ip=10.159.48.64)^[[0m 2022-12-16 16:14:08,872    INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=132822, ip=10.159.48.31)^[[0m 2022-12-16 16:14:40,552    INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=132822, ip=10.159.48.31)^[[0m 2022-12-16 16:14:41,096    INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=4021417, ip=10.159.48.65)^[[0m 2022-12-16 16:14:58,410   INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=4021417, ip=10.159.48.65)^[[0m 2022-12-16 16:14:58,995   INFO train_loop_utils.py:270 -- Moving model to device: cuda:0

What is wrong in my code? Am I assigning wrongly the resources to the TorchTrainer? Thank you for the help !

UPDATE:

The problem is solved by setting num_workers=2 and changing the assigned resources as

trainer = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        scaling_config=ScalingConfig(
            resources_per_worker={"CPU": 1, "GPU":1},
            use_gpu=True,
            num_workers=2,
        )
    )

Now all 4 nodes are used in parallel for the tuning and in each of them the 2 GPUs are used to train the model with DDP.

Anyway it is not clear to me the difference between resources_per_worker and trainer_resources (which in this case is 1 CPU by default) in the documentation.

Let’s say that I want to use n workers for the train DataLoader by setting num_workers=n in its definition inside the train_loop_per_worker function. Do I need to take into account this when assigning the resources of TorchTrainer? If yes, in resources_per_worker or trainer_resources?

Hey @Matteo_Bastico, glad you are able to get this working!

Let’s say that I want to use n workers for the train DataLoader by setting num_workers=n in its definition inside the train_loop_per_worker function. Do I need to take into account this when assigning the resources of TorchTrainer ? If yes, in resources_per_worker or trainer_resources ?

In this case you would want to take this into account in resources_per_worker. This is because the DataLoader is instantiated and executed inside train_loop_per_worker, which is executed on the worker side, and not on the trainer side.