An example where setting ScalingConfig (trainer_resource= ) is useful?

ordem_at · May 17, 2023, 7:43am

So I have 1 node, 3CPU and 3GPU I have these two case

argument for num_workers=1, use_gpu=True
Case A:

trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        train_loop_config=config,
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu,
                                      resources_per_worker={'CPU':1, 'GPU':1},
                                      trainer_resources={'CPU':0}
                                      ),
    )
    result = trainer.fit()

This will request

### System Info
Using FIFO scheduling algorithm.
Resources requested: 1.0/3 CPUs, 1.0/3 GPUs, 0.0/16.48 GiB heap, 0.0/8.24 GiB objects (0.0/1.0 accelerator_type:G)

Case B:

trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        train_loop_config=config,
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu,
                                      resources_per_worker={'CPU':1, 'GPU':1},
                                      trainer_resources={'CPU':1} #the default value, writing for clarity
                                      ),
    )
    result = trainer.fit()

This will request

### System Info
Using FIFO scheduling algorithm.
Resources requested: 2.0/3 CPUs, 1.0/3 GPUs, 0.0/16.48 GiB heap, 0.0/8.24 GiB objects (0.0/1.0 accelerator_type:G)

Summary
Okay so far I understand there are two components whose resource we can configure: trainer and worker to execute the code, such that

Case A, total CPU=1 (0 trainer, 1 worker)
Case B, total CPU=2 (1 trainer, 1 worker)
both allocate 1 GPU which is configured by worker.

Question
What example does setting trainer_resources={‘CPU’=1} is useful?
Or what is purpose of train_resource, can it also contain GPU?
Cause from the above cases I see they behave equally.
I don’t see anything in docs that discuss this further, let me know if I missed it. I also still quite confused with differences between trainer_resources and resources_per_worker, some explanation would be appreciated.

Thank you!

gjoliver · May 19, 2023, 5:42am

Trainer by default occupies 1 CPU:

github.com

ray-project/ray/blob/f170d13a92e2b99f8c44ca927214b051394dcee0/python/ray/air/config.py#L95-L96


      
          trainer_resources: Resources to allocate for the trainer. If None is provided,
              will default to 1 CPU.

Usually it doesn’t do too much work besides metrics aggregation and reporting, checkpoint writing etc. So 1 CPU is enough.

If your resource is really tight, you can experiment with giving trainer 0 CPU for example.
or if checkpointing is really complicated for some model, you can try to allocate more CPUs to the trainer so it doesn’t become a bottleneck.

Topic		Replies	Views
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	550	April 10, 2023
ScalingConfig with Ray Tune Ray Tune	0	315	February 12, 2024
When to use multi gpus per worker for a training job	1	231	September 15, 2024
ScalingConfig() num_workers not corresponding to training runs? Ray Train	8	786	February 5, 2024
Multi-gpu ray tune for hparams not parallelizing and only using first gpu	0	81	July 10, 2024

An example where setting ScalingConfig (trainer_resource= ) is useful?

Related topics