What is the right way of using Ray tune with Pytorch DDP

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I searched around and didn’t find a good answer of using Ray tune alone with Pytorch DDP model.

I am able to use the nn.DataParallel to wrap the model and run on single node. For DDP, usually we use mp.spawn() or torchrun to launch multiple processes each with a different “rank”, where should I put the tune.run(), inside each process or outside? In general, I am not sure how the Ray tune processes work with the multiple processes launched by PyTorch. Has anyone successfully run hyperparameter tuning with tune + ddp? Thanks.

I try to use the prepare_model() in a train function to create DDP model, but the world_size is None,

  File "python3.11/site-packages/ray/train/torch/train_loop_utils.py", line 328, in prepare_model
    if parallel_strategy and world_size > 1:
TypeError: '>' not supported between instances of 'NoneType' and 'int'

My code flow looks like the follows,

def train_func(config):
    .... ...
    prepare_model(my_model)
    ... ...

tune.run(
        tune.with_parameters(
            train_func, ... ...),
        resources_per_trial={"cpu": 2, "gpu": 4},
        config=param_space,
        num_samples=1,
    )

Hi @veydan , the best way is to use TorchTrainer + Tuner. You can refer to this example for more details: Using PyTorch Lightning with Tune — Ray 3.0.0.dev0