Ray Tune does not work properly with DDP PyTorch Lightning

yongw · January 22, 2021, 3:14am

I found that Ray Tune does not work properly with DDP PyTorch Lightning. My specific situation is as follows.

Ray 1.2.0.dev0, pytorch 1.7,pytorch lightning 1.1.1.
I have one machine with 80 CPU cores and 2 GPUs.

I want to use Ray Tune to carry out 1 trial, which requires 10 CPU cores and 2 GPUs.Using the DistributedDataParallel of PyTorch Lightning. I use “DistributedTrainableCreator” this way.

DistributedTrainableCreator(train_model_with_parameters, num_workers= 1, num_cpus_per_worker = 10,num_gpus_per_worker = 2)

But I got this warning. Soon after, the program ended unexpectedly. I observe that only GPU 0 is working and GPU 1 is not started.

May I ask how this problem is caused and how to solve it? Thank you very much for your help!

kai · January 22, 2021, 9:33am

Hi @yongw, we recently received a number of messages concerning multi GPU training with PyTorch Lightning. We’re working on resolving this issue as soon as possible. Cc @amogkam who is working on this

yongw · January 22, 2021, 10:36am

Thank you very much.

amogkam · February 11, 2021, 7:57am

Hi @yongw, we just finished implementing a Ray backend for distributed Pytorch Lightning training here- GitHub - ray-project/ray_lightning_accelerators: Pytorch Lightning Distributed Accelerators using Ray.

The package introduces 2 new Pytorch Lightning accelerators for quick and easy distributed training on Ray.

It also integrates with Tune and should resolve your issue. Now you can use Tune to run multiple trials in parallel, and each trial can itself be distributed with any number of CPUs or GPUs.

Please check it out, and let us know how it goes!

yongw · February 18, 2021, 6:57am

Thanks for your reply. I’ve tried it and it works fine.
I have one question about resource allocation.
I have one machine with 80 CPU cores and 2 GPUs. should I have to set “num_workers” =1. I don’t understand what “num_workers” mean?

yongw · February 18, 2021, 7:19am

Hi, @amogkam .For example , if I have one machine with 80 CPU cores and 8 GPUs. I want to use Ray Tune to carry out 3 trials, every trial needs 10 CPU cores and 2 GPUs, use DDP.

how to set “cpu” ,“gpu”, “extra_cpu”, “extra_gpu” in ray.tune.run.

how to set “num_workers” , “cpus_per_worker” , “use_gpu” in RayAccelerator.

I’ve looked at this ray_ddp_tune.py, but maybe it uses too few resources.

Looking forward to your reply.

yongw · February 18, 2021, 7:26am

There is a little mistake in ReadMe. “cpus_per_worker” should change to “num_cpus_per_worker”

Manoj_Kumar_Dobbali · March 15, 2022, 10:15pm

Hi, This is what I have been trying to achieve on a multi node multi GPU machine. It doesn’t seems to work as expected.

Here(https://github.com/ray-project/ray_lightning/blob/65f497a3c8bedb2f24bf04a5dbf0ea62b5bcb4d6/ray_lightning/ray_ddp.py#L84) it says, specify GPUs in Pytorch Lightning Trainer to a value > 0

and Here(https://github.com/ray-project/ray_lightning/blob/65f497a3c8bedb2f24bf04a5dbf0ea62b5bcb4d6/ray_lightning/ray_ddp.py#L107) in example it says NOT to specify resource in Trainer

I have 5 worker nodes with each having 8 GPUs and 16 CPUS. Not sure how to allocated resources within RayPlugin & get_tune_resources function.

xwjiang2010 · March 17, 2022, 8:58pm

Hi @Manoj_Kumar_Dobbali
Can you share your script? I can help take a look!

Topic		Replies	Views
Distributed Training & Distributed Tuning using Ray Tune, PLT, Ray Lightning Ray Clusters	1	376	April 25, 2022
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	1004	August 24, 2021
Ray Train/Tune issue: concurrent trials conflict on GPU nodes Ray Tune	2	58	February 12, 2025
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	314	May 15, 2023
"Tune detects GPUs" warning trigger even though GPU is requested in resources_per_trial Ray Tune	1	722	September 2, 2021

Ray Tune does not work properly with DDP PyTorch Lightning

Related topics