Data Parallelism with Ray Tune

I am running Ray Tune with gpu=1 per trial, but each trial is only using one GPU although I specified net = torch.nn.DataParallel(net) as opposed to splitting the data across all 8 GPUs, do I need to set gpu=8 in the resource configuration to allow data parallelism across GPUs?

This is the resource config:
tuner = tune.Tuner(
resources={“cpu”: 2, “gpu”: 1} # per trial by default

I think my goal is to split the computation of each trial across the GPUs (using DataParallel) while running multiple trials with multi-processing in parallel, but I have not figured out the best way to do it. For example, if I run my tuner with resources={“cpu”: 2, “gpu”: 1}, each trial will run on its own GPU but this would not benefit from the data parallelism you could get from DataParallel. In addition, if I were to run with resources={“cpu”: 2, “gpu”: 8}, only one trial could benefit from the data parallelism.

What is your desired behavior here? How many GPUs do you want each trial to use?

There are 8 GPUs available, and I want to maximize the parallelism available without limiting each trial to the number of physical gpus assigned to each worker. As they are shared resources, I want to avoid hotspotting a single GPU.

The desired behavior is to enable each trial to provision all 8 GPUs available so that the batch can be sharded across the GPUs while running more than one trials in parallel.

Currently this is impossible because setting “gpu”: 1 causes hotspotting a single gpu while “gpu”: 8 causes abuse of resources (leaving a lot of resources unused).

Can you explain more about your use case? Typically you want to use data parallelism if you are bound by the memory of a single GPU. If you have N trials, it would not be beneficial for each of them to be running in a data parallel fashion across all 8 GPUs.

As I am running in a shared environment where lots of people could be using the same GPUs, training the whole batch in a single GPU could potentially cause OOM error.

To avoid the OOM error, I would like to shard the batch across GPUs.

As you mentioned this is probably not a good solution performance-wise but a workaround would be nice.