Ray Trainer prepare_model gets stuck

I am using Ray Trainer in a typical training setup for distributed learning. My problem is that my code gets stuck on the line with “student = rt.prepare_model(student)” and on the print with “Wrapping provided model in DDP.”. I’ve let the script run for several minutes without any visible progress.

Below is a minimal example, even without a DataLoader. It is on Windows, hence the gloo backend, the same happens on Google Colab with the regular nccl backend, i.e. Trainer(backend=“torch”). For now, my system only uses 1 GPU, however I’m not aware that this would be a problem for ray, since it is not for regular DDP.

Minimal working example:

import ray
import ray.train as rt
import torch.nn as nn
from ray.train.torch import TorchConfig, prepare_model
from ray.train import Trainer

def train_distributed():
    student = Mlp(10, 20, 10)
    student = prepare_model(student)

    for epoch in range(2):
        for i in range(10):

class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features, out_features):
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.relu2(x)
        return x

if __name__ == '__main__':
    trainer = Trainer(backend=TorchConfig(backend="gloo"), num_workers=2, use_gpu=True,
                      resources_per_worker={"GPU": 0.4})


2022-03-04 17:17:05,219|INFO trainer.py:190 -- Trainer logs will be logged in: C:\Users\kaise\ray_results\train_2022-03-04_17-17-05
 pid=25556) 2022-03-04 17:17:07,158 INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=2]
 pid=22548) 2022-03-04 17:17:07,148 INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=2]
2022-03-04 17:17:08,234|INFO trainer.py:196 -- Run results will be logged in: C:\Users\kaise\ray_results\train_2022-03-04_17-17-05\run_001
 pid=25556) 2022-03-04 17:17:09,484 INFO torch.py:239 -- Moving model to device: cuda:0
 pid=25556) 2022-03-04 17:17:09,594 INFO torch.py:242 -- Wrapping provided model in DDP.

The same also happens when including a dataset and dataloader and actually making use of them and the model in the train function.

I’ve tried the Callback functionality with train.report(), in case prints from the workers are being suppressed, didn’t change anything. Also, CPU and GPU (if included) are mostly idle in the background while the script is stuck, so I’m positive that nothing is actually happening in the background and the script is indeed stuck.

The script is so barebones that I have no idea what the problem could be, especially because the error is identical between my laptop and google colab. It’s probably user error, I’m out of ideas though and read the Trainer docs and guides 20x times by now.

Windows 10
torch 1.10.2+cu113
torchaudio 0.10.2+cu113
torchvision 0.11.3+cu113
ray 1.10.0
Python 3.8.5 (conda installation)

Google Colab
torch 1.10.0+cu111
torchaudio 0.10.0+cu111
torchvision 0.11.1+cu111
ray 1.10.0

1 Like

Ah @tikai103 thanks for pointing this out!

You’re right this is a bug with Ray Train and using fractional GPUs. In particular, in the get_device utility method here: ray/torch.py at master · ray-project/ray · GitHub. We will fix this in short order, but in the meantime, can you just wrap your model with DistributedDataParallel and move it to the device directly instead of using prepare_model?

Thank your for the quick response!

I will try this and and come back to this thread if I run into any problems. Might also be helpful for other people with this use case then.

How would I know when you fixed the bug? When a new ray version is released?

Trying to manually wrap the model in DDP produces similar freezing behavior. I changed the first 2 lines in “train_distributed” to

device = torch.device(f"cuda:{train.local_rank()}" if torch.cuda.is_available() else "cpu")

model = Mlp(10, 20, 10)
model = model.to(device)
print("Moved model to device:", device)
print(dist.is_initialized(), torch.cuda.is_available())
DDP_model = DistributedDataParallel(model, device_ids=[train.local_rank()] if torch.cuda.is_available() else None)
print("Wrapped model in DDP")

and get the expected print:

2022-03-07 10:18:46,077	INFO trainer.py:190 -- Trainer logs will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-18-46
 pid=14924) 2022-03-07 10:18:49,857	INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=2]
 pid=19624) 2022-03-07 10:18:49,873	INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=2]
2022-03-07 10:18:49,935	INFO trainer.py:196 -- Run results will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-18-46\run_001
 pid=14924) Moved model to device: cuda:0
 pid=14924) True True

When trying to manually initialize the default process group, it gets stuck on the last line:

print(dist.is_initialized(), train.local_rank(), train.world_size())
dist.init_process_group("gloo", rank=train.local_rank(), world_size=train.world_size())

with print:

2022-03-07 10:12:57,949	INFO trainer.py:190 -- Trainer logs will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-12-57
 pid=19732) 2022-03-07 10:13:01,793	INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=2]
 pid=26516) 2022-03-07 10:13:01,793	INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=2]
2022-03-07 10:13:02,887	INFO trainer.py:196 -- Run results will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-12-57\run_001
 pid=19732) False 0 2

In both cases, the script is stuck after this print. Something in the setup with Trainer seems to cause deadlocks for DDP, I guess because of the fractional GPUs, but I don’t know. Also, the freezing behavior is still consistent between my windows laptop with gloo backend and Goggle Colab with nccl backend.

The DDP docs of “init_process_group” say " If using multiple processes per machine with nccl backend, each process must have exclusive access to every GPU it uses, as sharing GPUs between processes can result in deadlocks."
Any advice on how to get this to work somehow? Or will I simply have to wait for the fix with a new ray version?

I had the same issue. Did you solve this?

No, since this is a fundamental issue with ray, I couldn’t do anything about it. I parallelized in different ways and just live without this kind of multiprocessing.

1 Like

The fix has been merged into master ([Train] Fix `train.torch.get_device()` for fractional GPU or multiple GPU per worker case by amogkam · Pull Request #23763 · ray-project/ray · GitHub) and will be included in the 1.13 release!

@tikai103 when wrapping with DistributedDataParallel manually with fractional GPUs, using train.local_rank() will not work (this was the original bug to begin with), as there are more workers than GPUs.