An example where setting ScalingConfig (trainer_resource= ) is useful?

So I have 1 node, 3CPU and 3GPU I have these two case

argument for num_workers=1, use_gpu=True
Case A:

trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        train_loop_config=config,
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu,
                                      resources_per_worker={'CPU':1, 'GPU':1},
                                      trainer_resources={'CPU':0}
                                      ),
    )
    result = trainer.fit()

This will request

### System Info
Using FIFO scheduling algorithm.
Resources requested: 1.0/3 CPUs, 1.0/3 GPUs, 0.0/16.48 GiB heap, 0.0/8.24 GiB objects (0.0/1.0 accelerator_type:G)

Case B:

trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        train_loop_config=config,
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu,
                                      resources_per_worker={'CPU':1, 'GPU':1},
                                      trainer_resources={'CPU':1} #the default value, writing for clarity
                                      ),
    )
    result = trainer.fit()

This will request

### System Info
Using FIFO scheduling algorithm.
Resources requested: 2.0/3 CPUs, 1.0/3 GPUs, 0.0/16.48 GiB heap, 0.0/8.24 GiB objects (0.0/1.0 accelerator_type:G)

Summary
Okay so far I understand there are two components whose resource we can configure: trainer and worker to execute the code, such that

  • Case A, total CPU=1 (0 trainer, 1 worker)
  • Case B, total CPU=2 (1 trainer, 1 worker)
    both allocate 1 GPU which is configured by worker.

Question
What example does setting trainer_resources={‘CPU’=1} is useful?
Or what is purpose of train_resource, can it also contain GPU?
Cause from the above cases I see they behave equally.
I don’t see anything in docs that discuss this further, let me know if I missed it. I also still quite confused with differences between trainer_resources and resources_per_worker, some explanation would be appreciated.

Thank you! :slight_smile:

Trainer by default occupies 1 CPU:

Usually it doesn’t do too much work besides metrics aggregation and reporting, checkpoint writing etc. So 1 CPU is enough.

If your resource is really tight, you can experiment with giving trainer 0 CPU for example.
or if checkpointing is really complicated for some model, you can try to allocate more CPUs to the trainer so it doesn’t become a bottleneck.