When to use multi gpus per worker for a training job

Hello everyone,

I would like to know when should I use multi GPUs per worker for a ray training job by specifying the scaling_config to be something like

ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":4})

So far I haven’t found any working examples using this setup yet, only seeing the document saying it’s possible to do so.

Also this raises a question in my mind, would the code be significantly different from the one with only one GPU per worker? Is there really any benefit using multi GPUs per worker for a ray training job?

For launching a distributed training job in ray with 8 GPUs, which scaling config is recommended, and when should multi gpu per worker scaling config should be used?

ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":4})
ScalingConfig(num_workers=8, use_gpu=True, resources_per_worker={"GPU":1})

Take the pytorch fashion mnist code for example, how should I modify this in order to fully utilize the 8 GPUs with a multi gpu per worker scaling config like this? Is this a good practice?

ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":4})

For a default training setup torch will only use one gpu per process. This is more of a torch DDP question than a ray one.

For 99% of use-cases stir away from manually specifying resources_per_worker for training jobs, just set num_workers to the number of gpus in your cluster.