Ray + torch.distributed/DDP resource management


I have a PyTorch training script, which handles all the nodes/processes through torch.distributed and $torchrun.

I want to integrate Ray for hyperparameter tuning, however while staying with torch.distributed to manage the training on multiple GPUs. Ray should act as a “wrapper” around that script, starting the training and leaving the resource management to the script itself. In that scenario Ray could run on the CPU and just receive metric values from the script that is running on multiple GPUs

I presume this conflicts with the resources set/determined in tune.run(). Is this possible/viable and if yes, is there an example somewhere? Or is Ray too “high level”, and another framework might more fit my usecase? Most scripts I have found left the resource management to Ray.

Edit: To clarify, I have Optuna running (but I would prefer to use RayTune). Optunas DDP example works side by side with DDP optuna-examples/pytorch_distributed_simple.py at main · optuna/optuna-examples · GitHub . I am seeking something comparable with RayTune.

Hey @Eisbaer- I would recommend using the Ray TorchTrainer to handle distributed training instead of torchrun. TorchTrainer will take care of all the torch.distributed setup for you so you only have to write your model code.

TorchTrainer has a built-in integration with Tune so you can you distributed hyperparameter tuning+distributed training with Ray handling the resource management for both.

You can find more here: Deep Learning User Guide — Ray 2.0.0. Let me know if this works for you.