When to use multi gpus per worker for a training job

HCharlie · September 14, 2024, 5:11am

Hello everyone,

I would like to know when should I use multi GPUs per worker for a ray training job by specifying the scaling_config to be something like

ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":4})

So far I haven’t found any working examples using this setup yet, only seeing the document saying it’s possible to do so.

Also this raises a question in my mind, would the code be significantly different from the one with only one GPU per worker? Is there really any benefit using multi GPUs per worker for a ray training job?

For launching a distributed training job in ray with 8 GPUs, which scaling config is recommended, and when should multi gpu per worker scaling config should be used?

ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":4})

ScalingConfig(num_workers=8, use_gpu=True, resources_per_worker={"GPU":1})

Take the pytorch fashion mnist code for example, how should I modify this in order to fully utilize the 8 GPUs with a multi gpu per worker scaling config like this? Is this a good practice?

ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":4})

eric · September 15, 2024, 4:28am

For a default training setup torch will only use one gpu per process. This is more of a torch DDP question than a ray one.

For 99% of use-cases stir away from manually specifying resources_per_worker for training jobs, just set num_workers to the number of gpus in your cluster.

Topic		Replies	Views
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	550	April 10, 2023
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	433	February 19, 2021
Ray train parallelize on single GPU	4	1713	July 24, 2023
ScalingConfig with Ray Tune Ray Tune	0	315	February 12, 2024
An example where setting ScalingConfig (trainer_resource= ) is useful?	1	422	May 19, 2023

When to use multi gpus per worker for a training job

Related topics