How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I searched around and didn’t find a good answer of using Ray tune alone with Pytorch DDP model.
I am able to use the nn.DataParallel to wrap the model and run on single node. For DDP, usually we use mp.spawn() or torchrun to launch multiple processes each with a different “rank”, where should I put the tune.run(), inside each process or outside? In general, I am not sure how the Ray tune processes work with the multiple processes launched by PyTorch. Has anyone successfully run hyperparameter tuning with tune + ddp? Thanks.
I try to use the prepare_model() in a train function to create DDP model, but the world_size is None,
File "python3.11/site-packages/ray/train/torch/train_loop_utils.py", line 328, in prepare_model
if parallel_strategy and world_size > 1:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
My code flow looks like the follows,
def train_func(config):
.... ...
prepare_model(my_model)
... ...
tune.run(
tune.with_parameters(
train_func, ... ...),
resources_per_trial={"cpu": 2, "gpu": 4},
config=param_space,
num_samples=1,
)