I found that Ray Tune does not work properly with DDP PyTorch Lightning. My specific situation is as follows.
Ray 1.2.0.dev0, pytorch 1.7,pytorch lightning 1.1.1.
I have one machine with 80 CPU cores and 2 GPUs.
I want to use Ray Tune to carry out 1 trial, which requires 10 CPU cores and 2 GPUs.Using the DistributedDataParallel of PyTorch Lightning. I use “DistributedTrainableCreator” this way.
Hi @yongw, we recently received a number of messages concerning multi GPU training with PyTorch Lightning. We’re working on resolving this issue as soon as possible. Cc @amogkam who is working on this
The package introduces 2 new Pytorch Lightning accelerators for quick and easy distributed training on Ray.
It also integrates with Tune and should resolve your issue. Now you can use Tune to run multiple trials in parallel, and each trial can itself be distributed with any number of CPUs or GPUs.
Thanks for your reply. I’ve tried it and it works fine.
I have one question about resource allocation.
I have one machine with 80 CPU cores and 2 GPUs. should I have to set “num_workers” =1. I don’t understand what “num_workers” mean?
Hi, @amogkam .For example , if I have one machine with 80 CPU cores and 8 GPUs. I want to use Ray Tune to carry out 3 trials, every trial needs 10 CPU cores and 2 GPUs, use DDP.
how to set “cpu” ,“gpu”, “extra_cpu”, “extra_gpu” in ray.tune.run.
how to set “num_workers” , “cpus_per_worker” , “use_gpu” in RayAccelerator.
I’ve looked at this ray_ddp_tune.py, but maybe it uses too few resources.