Ray + torch.distributed/DDP resource management

Eisbaer · September 21, 2022, 1:00pm

Hello,

I have a PyTorch training script, which handles all the nodes/processes through torch.distributed and $torchrun.

I want to integrate Ray for hyperparameter tuning, however while staying with torch.distributed to manage the training on multiple GPUs. Ray should act as a “wrapper” around that script, starting the training and leaving the resource management to the script itself. In that scenario Ray could run on the CPU and just receive metric values from the script that is running on multiple GPUs

I presume this conflicts with the resources set/determined in tune.run(). Is this possible/viable and if yes, is there an example somewhere? Or is Ray too “high level”, and another framework might more fit my usecase? Most scripts I have found left the resource management to Ray.

Edit: To clarify, I have Optuna running (but I would prefer to use RayTune). Optunas DDP example works side by side with DDP optuna-examples/pytorch/pytorch_distributed_simple.py at main · optuna/optuna-examples · GitHub . I am seeking something comparable with RayTune.

amogkam · September 21, 2022, 6:30pm

Hey @Eisbaer- I would recommend using the Ray TorchTrainer to handle distributed training instead of torchrun. TorchTrainer will take care of all the torch.distributed setup for you so you only have to write your model code.

TorchTrainer has a built-in integration with Tune so you can you distributed hyperparameter tuning+distributed training with Ray handling the resource management for both.

You can find more here: Deep Learning User Guide — Ray 2.0.0. Let me know if this works for you.

Topic		Replies	Views
[Ray] How to implement distributed DDP in pytorch using only pytorch And ray? Ray Tune	1	819	July 28, 2021
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	983	August 24, 2021
What is the right way of using Ray tune with Pytorch DDP Ray Tune	1	887	February 23, 2024
Tune with Function API and torch.multiprocessing.spawn	0	288	February 6, 2024
Distributed Training with tune Ray Tune	3	327	May 21, 2021

Ray + torch.distributed/DDP resource management

Related topics