Ray Tune does not work properly with DDP PyTorch Lightning

I found that Ray Tune does not work properly with DDP PyTorch Lightning. My specific situation is as follows.

Ray 1.2.0.dev0, pytorch 1.7,pytorch lightning 1.1.1.
I have one machine with 80 CPU cores and 2 GPUs.

I want to use Ray Tune to carry out 1 trial, which requires 10 CPU cores and 2 GPUs.Using the DistributedDataParallel of PyTorch Lightning. I use “DistributedTrainableCreator” this way.

DistributedTrainableCreator(train_model_with_parameters, num_workers= 1, num_cpus_per_worker = 10,num_gpus_per_worker = 2)

But I got this warning. Soon after, the program ended unexpectedly. I observe that only GPU 0 is working and GPU 1 is not started.

May I ask how this problem is caused and how to solve it? Thank you very much for your help!

Hi @yongw, we recently received a number of messages concerning multi GPU training with PyTorch Lightning. We’re working on resolving this issue as soon as possible. Cc @amogkam who is working on this

Thank you very much.

Hi @yongw, we just finished implementing a Ray backend for distributed Pytorch Lightning training here- GitHub - ray-project/ray_lightning_accelerators: Pytorch Lightning Distributed Accelerators using Ray.

The package introduces 2 new Pytorch Lightning accelerators for quick and easy distributed training on Ray.

It also integrates with Tune and should resolve your issue. Now you can use Tune to run multiple trials in parallel, and each trial can itself be distributed with any number of CPUs or GPUs.

Please check it out, and let us know how it goes!

Thanks for your reply. I’ve tried it and it works fine.
I have one question about resource allocation.
I have one machine with 80 CPU cores and 2 GPUs. should I have to set “num_workers” =1. I don’t understand what “num_workers” mean?

Hi, @amogkam .For example , if I have one machine with 80 CPU cores and 8 GPUs. I want to use Ray Tune to carry out 3 trials, every trial needs 10 CPU cores and 2 GPUs, use DDP.

how to set “cpu” ,“gpu”, “extra_cpu”, “extra_gpu” in ray.tune.run.

how to set “num_workers” , “cpus_per_worker” , “use_gpu” in RayAccelerator.

I’ve looked at this ray_ddp_tune.py, but maybe it uses too few resources.

Looking forward to your reply.

There is a little mistake in ReadMe. “cpus_per_worker” should change to “num_cpus_per_worker”

Hi, This is what I have been trying to achieve on a multi node multi GPU machine. It doesn’t seems to work as expected.

Here(https://github.com/ray-project/ray_lightning/blob/65f497a3c8bedb2f24bf04a5dbf0ea62b5bcb4d6/ray_lightning/ray_ddp.py#L84) it says, specify GPUs in Pytorch Lightning Trainer to a value > 0

and Here(https://github.com/ray-project/ray_lightning/blob/65f497a3c8bedb2f24bf04a5dbf0ea62b5bcb4d6/ray_lightning/ray_ddp.py#L107) in example it says NOT to specify resource in Trainer

I have 5 worker nodes with each having 8 GPUs and 16 CPUS. Not sure how to allocated resources within RayPlugin & get_tune_resources function.

Hi @Manoj_Kumar_Dobbali
Can you share your script? I can help take a look!