What is the right way of using Ray tune with Pytorch DDP

veydan · February 21, 2024, 8:05pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I searched around and didn’t find a good answer of using Ray tune alone with Pytorch DDP model.

I am able to use the nn.DataParallel to wrap the model and run on single node. For DDP, usually we use mp.spawn() or torchrun to launch multiple processes each with a different “rank”, where should I put the tune.run(), inside each process or outside? In general, I am not sure how the Ray tune processes work with the multiple processes launched by PyTorch. Has anyone successfully run hyperparameter tuning with tune + ddp? Thanks.

I try to use the prepare_model() in a train function to create DDP model, but the world_size is None,

  File "python3.11/site-packages/ray/train/torch/train_loop_utils.py", line 328, in prepare_model
    if parallel_strategy and world_size > 1:
TypeError: '>' not supported between instances of 'NoneType' and 'int'

My code flow looks like the follows,

def train_func(config):
    .... ...
    prepare_model(my_model)
    ... ...

tune.run(
        tune.with_parameters(
            train_func, ... ...),
        resources_per_trial={"cpu": 2, "gpu": 4},
        config=param_space,
        num_samples=1,
    )

yunxuanx · February 23, 2024, 10:33pm

Hi @veydan , the best way is to use TorchTrainer + Tuner. You can refer to this example for more details: Using PyTorch Lightning with Tune — Ray 3.0.0.dev0

Topic		Replies	Views
PyTorch DDP with Ray Tune fails Ray Tune	4	1040	November 23, 2022
Ray.tune with Pytorch DDP Ray Tune	1	816	November 23, 2022
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	1001	August 24, 2021
Ray + torch.distributed/DDP resource management	1	1183	September 21, 2022
Ray Tune does not work properly with DDP PyTorch Lightning Ray Tune	8	1650	March 17, 2022

What is the right way of using Ray tune with Pytorch DDP

Related topics