I am using Ray Trainer in a typical training setup for distributed learning. My problem is that my code gets stuck on the line with “student = rt.prepare_model(student)” and on the print with “Wrapping provided model in DDP.”. I’ve let the script run for several minutes without any visible progress.
Below is a minimal example, even without a DataLoader. It is on Windows, hence the gloo backend, the same happens on Google Colab with the regular nccl backend, i.e. Trainer(backend=“torch”). For now, my system only uses 1 GPU, however I’m not aware that this would be a problem for ray, since it is not for regular DDP.
Minimal working example:
import ray
import ray.train as rt
import torch.nn as nn
from ray.train.torch import TorchConfig, prepare_model
from ray.train import Trainer
def train_distributed():
student = Mlp(10, 20, 10)
student = prepare_model(student)
for epoch in range(2):
for i in range(10):
rt.report()
class Mlp(nn.Module):
def __init__(self, in_features, hidden_features, out_features):
super().__init__()
self.fc1 = nn.Linear(in_features, hidden_features)
self.fc2 = nn.Linear(hidden_features, out_features)
self.relu1 = nn.ReLU()
self.relu2 = nn.ReLU()
def forward(self, x):
x = self.fc1(x)
x = self.relu1(x)
x = self.fc2(x)
x = self.relu2(x)
return x
if __name__ == '__main__':
ray.init()
trainer = Trainer(backend=TorchConfig(backend="gloo"), num_workers=2, use_gpu=True,
resources_per_worker={"GPU": 0.4})
trainer.start()
trainer.run(train_distributed)
trainer.shutdown()
Print:
2022-03-04 17:17:05,219|INFO trainer.py:190 -- Trainer logs will be logged in: C:\Users\kaise\ray_results\train_2022-03-04_17-17-05
pid=25556) 2022-03-04 17:17:07,158 INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=2]
pid=22548) 2022-03-04 17:17:07,148 INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=2]
2022-03-04 17:17:08,234|INFO trainer.py:196 -- Run results will be logged in: C:\Users\kaise\ray_results\train_2022-03-04_17-17-05\run_001
pid=25556) 2022-03-04 17:17:09,484 INFO torch.py:239 -- Moving model to device: cuda:0
pid=25556) 2022-03-04 17:17:09,594 INFO torch.py:242 -- Wrapping provided model in DDP.
The same also happens when including a dataset and dataloader and actually making use of them and the model in the train function.
I’ve tried the Callback functionality with train.report(), in case prints from the workers are being suppressed, didn’t change anything. Also, CPU and GPU (if included) are mostly idle in the background while the script is stuck, so I’m positive that nothing is actually happening in the background and the script is indeed stuck.
The script is so barebones that I have no idea what the problem could be, especially because the error is identical between my laptop and google colab. It’s probably user error, I’m out of ideas though and read the Trainer docs and guides 20x times by now.
System:
Windows 10
torch 1.10.2+cu113
torchaudio 0.10.2+cu113
torchvision 0.11.3+cu113
ray 1.10.0
Python 3.8.5 (conda installation)
Google Colab
torch 1.10.0+cu111
torchaudio 0.10.0+cu111
torchvision 0.11.1+cu111
ray 1.10.0