Distributed Training & Distributed Tuning using Ray Tune, PLT, Ray Lightning

Manoj_Kumar_Dobbali · March 15, 2022, 7:34pm

Hi, I am using PyTorch Lightning(1.5.10), Ray Lightning(0.2) and Ray Tune(1.10.0) to distribute training and tuning. It is unclear to me on how to do resource allocation using RayPlugin from the documentation

Here(ray_lightning/ray_ddp.py at 65f497a3c8bedb2f24bf04a5dbf0ea62b5bcb4d6 · ray-project/ray_lightning · GitHub) it says, specify GPUs in Pytorch Lightning Trainer to a value > 0

and Here(ray_lightning/ray_ddp.py at 65f497a3c8bedb2f24bf04a5dbf0ea62b5bcb4d6 · ray-project/ray_lightning · GitHub) in example it says NOT to specify resource in Trainer

Which one is right? I am able to run tuning experiments in parallel but unable to distribute training.I have 5 workers nodes with each 8 GPUs on each node

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Ameer_Haj_Ali · April 25, 2022, 11:51am

referring to @xiaowei / @rliaw here.

Topic		Replies	Views
Ray Tune does not work properly with DDP PyTorch Lightning Ray Tune	8	1688	March 17, 2022
Resource allocation for Ray Cluster running on Kubernetes Ray Clusters	1	428	March 15, 2022
Tune + Pytorch Lightning on Slurm: How to correctly assign the resources Ray Clusters	1	809	January 12, 2023
Reserve workers on GPU node for trainer workers only RLlib	7	1123	June 3, 2022
Need help running tuning job on SLURM cluster with pytorch-lightning Ray Tune	7	1631	March 8, 2021

Distributed Training & Distributed Tuning using Ray Tune, PLT, Ray Lightning

Related topics