So I have 1 node, 3CPU and 3GPU I have these two case
argument for num_workers=1, use_gpu=True
Case A:
trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config=config,
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu,
resources_per_worker={'CPU':1, 'GPU':1},
trainer_resources={'CPU':0}
),
)
result = trainer.fit()
This will request
### System Info
Using FIFO scheduling algorithm.
Resources requested: 1.0/3 CPUs, 1.0/3 GPUs, 0.0/16.48 GiB heap, 0.0/8.24 GiB objects (0.0/1.0 accelerator_type:G)
Case B:
trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config=config,
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu,
resources_per_worker={'CPU':1, 'GPU':1},
trainer_resources={'CPU':1} #the default value, writing for clarity
),
)
result = trainer.fit()
This will request
### System Info
Using FIFO scheduling algorithm.
Resources requested: 2.0/3 CPUs, 1.0/3 GPUs, 0.0/16.48 GiB heap, 0.0/8.24 GiB objects (0.0/1.0 accelerator_type:G)
Summary
Okay so far I understand there are two components whose resource we can configure: trainer and worker to execute the code, such that
- Case A, total CPU=1 (0 trainer, 1 worker)
- Case B, total CPU=2 (1 trainer, 1 worker)
both allocate 1 GPU which is configured by worker.
Question
What example does setting trainer_resources={‘CPU’=1} is useful?
Or what is purpose of train_resource, can it also contain GPU?
Cause from the above cases I see they behave equally.
I don’t see anything in docs that discuss this further, let me know if I missed it. I also still quite confused with differences between trainer_resources
and resources_per_worker
, some explanation would be appreciated.
Thank you!