Hello,
I have a PyTorch training script, which handles all the nodes/processes through torch.distributed and $torchrun.
I want to integrate Ray for hyperparameter tuning, however while staying with torch.distributed to manage the training on multiple GPUs. Ray should act as a “wrapper” around that script, starting the training and leaving the resource management to the script itself. In that scenario Ray could run on the CPU and just receive metric values from the script that is running on multiple GPUs
I presume this conflicts with the resources set/determined in tune.run(). Is this possible/viable and if yes, is there an example somewhere? Or is Ray too “high level”, and another framework might more fit my usecase? Most scripts I have found left the resource management to Ray.
Edit: To clarify, I have Optuna running (but I would prefer to use RayTune). Optunas DDP example works side by side with DDP optuna-examples/pytorch_distributed_simple.py at main · optuna/optuna-examples · GitHub . I am seeking something comparable with RayTune.