Ray Trainer prepare_model gets stuck

tikai103 · March 4, 2022, 4:42pm

I am using Ray Trainer in a typical training setup for distributed learning. My problem is that my code gets stuck on the line with “student = rt.prepare_model(student)” and on the print with “Wrapping provided model in DDP.”. I’ve let the script run for several minutes without any visible progress.

Below is a minimal example, even without a DataLoader. It is on Windows, hence the gloo backend, the same happens on Google Colab with the regular nccl backend, i.e. Trainer(backend=“torch”). For now, my system only uses 1 GPU, however I’m not aware that this would be a problem for ray, since it is not for regular DDP.

Minimal working example:

import ray
import ray.train as rt
import torch.nn as nn
from ray.train.torch import TorchConfig, prepare_model
from ray.train import Trainer


def train_distributed():
    student = Mlp(10, 20, 10)
    student = prepare_model(student)

    for epoch in range(2):
        for i in range(10):
            rt.report()


class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features, out_features):
        super().__init__()
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.relu2(x)
        return x


if __name__ == '__main__':
    ray.init()
    trainer = Trainer(backend=TorchConfig(backend="gloo"), num_workers=2, use_gpu=True,
                      resources_per_worker={"GPU": 0.4})
    trainer.start()
    trainer.run(train_distributed)
    trainer.shutdown()

Print:

2022-03-04 17:17:05,219|INFO trainer.py:190 -- Trainer logs will be logged in: C:\Users\kaise\ray_results\train_2022-03-04_17-17-05
 pid=25556) 2022-03-04 17:17:07,158 INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=2]
 pid=22548) 2022-03-04 17:17:07,148 INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=2]
2022-03-04 17:17:08,234|INFO trainer.py:196 -- Run results will be logged in: C:\Users\kaise\ray_results\train_2022-03-04_17-17-05\run_001
 pid=25556) 2022-03-04 17:17:09,484 INFO torch.py:239 -- Moving model to device: cuda:0
 pid=25556) 2022-03-04 17:17:09,594 INFO torch.py:242 -- Wrapping provided model in DDP.

The same also happens when including a dataset and dataloader and actually making use of them and the model in the train function.

I’ve tried the Callback functionality with train.report(), in case prints from the workers are being suppressed, didn’t change anything. Also, CPU and GPU (if included) are mostly idle in the background while the script is stuck, so I’m positive that nothing is actually happening in the background and the script is indeed stuck.

The script is so barebones that I have no idea what the problem could be, especially because the error is identical between my laptop and google colab. It’s probably user error, I’m out of ideas though and read the Trainer docs and guides 20x times by now.

System:
Windows 10
torch 1.10.2+cu113
torchaudio 0.10.2+cu113
torchvision 0.11.3+cu113
ray 1.10.0
Python 3.8.5 (conda installation)

Google Colab
torch 1.10.0+cu111
torchaudio 0.10.0+cu111
torchvision 0.11.1+cu111
ray 1.10.0

amogkam · March 5, 2022, 12:10am

Ah @tikai103 thanks for pointing this out!

You’re right this is a bug with Ray Train and using fractional GPUs. In particular, in the get_device utility method here: ray/torch.py at master · ray-project/ray · GitHub. We will fix this in short order, but in the meantime, can you just wrap your model with DistributedDataParallel and move it to the device directly instead of using prepare_model?

tikai103 · March 5, 2022, 8:11am

Thank your for the quick response!

I will try this and and come back to this thread if I run into any problems. Might also be helpful for other people with this use case then.

How would I know when you fixed the bug? When a new ray version is released?

tikai103 · March 7, 2022, 9:26am

Trying to manually wrap the model in DDP produces similar freezing behavior. I changed the first 2 lines in “train_distributed” to

device = torch.device(f"cuda:{train.local_rank()}" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(device)

model = Mlp(10, 20, 10)
model = model.to(device)
print("Moved model to device:", device)
print(dist.is_initialized(), torch.cuda.is_available())
DDP_model = DistributedDataParallel(model, device_ids=[train.local_rank()] if torch.cuda.is_available() else None)
print("Wrapped model in DDP")

and get the expected print:

2022-03-07 10:18:46,077	INFO trainer.py:190 -- Trainer logs will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-18-46
 pid=14924) 2022-03-07 10:18:49,857	INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=2]
 pid=19624) 2022-03-07 10:18:49,873	INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=2]
2022-03-07 10:18:49,935	INFO trainer.py:196 -- Run results will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-18-46\run_001
 pid=14924) Moved model to device: cuda:0
 pid=14924) True True

When trying to manually initialize the default process group, it gets stuck on the last line:

dist.destroy_process_group()
print(dist.is_initialized(), train.local_rank(), train.world_size())
dist.init_process_group("gloo", rank=train.local_rank(), world_size=train.world_size())

with print:

2022-03-07 10:12:57,949	INFO trainer.py:190 -- Trainer logs will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-12-57
 pid=19732) 2022-03-07 10:13:01,793	INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=2]
 pid=26516) 2022-03-07 10:13:01,793	INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=2]
2022-03-07 10:13:02,887	INFO trainer.py:196 -- Run results will be logged in: C:\Users\kaise\ray_results\train_2022-03-07_10-12-57\run_001
 pid=19732) False 0 2

In both cases, the script is stuck after this print. Something in the setup with Trainer seems to cause deadlocks for DDP, I guess because of the fractional GPUs, but I don’t know. Also, the freezing behavior is still consistent between my windows laptop with gloo backend and Goggle Colab with nccl backend.

The DDP docs of “init_process_group” say " If using multiple processes per machine with nccl backend, each process must have exclusive access to every GPU it uses, as sharing GPUs between processes can result in deadlocks."
Any advice on how to get this to work somehow? Or will I simply have to wait for the fix with a new ray version?

myoh1 · May 24, 2022, 2:47pm

I had the same issue. Did you solve this?

tikai103 · May 24, 2022, 3:20pm

No, since this is a fundamental issue with ray, I couldn’t do anything about it. I parallelized in different ways and just live without this kind of multiprocessing.

amogkam · June 6, 2022, 10:06pm

The fix has been merged into master ([Train] Fix `train.torch.get_device()` for fractional GPU or multiple GPU per worker case by amogkam · Pull Request #23763 · ray-project/ray · GitHub) and will be included in the 1.13 release!

@tikai103 when wrapping with DistributedDataParallel manually with fractional GPUs, using train.local_rank() will not work (this was the original bug to begin with), as there are more workers than GPUs.

Topic		Replies	Views
Ray train not work in pretrain model Ray Train	1	746	March 28, 2023
Ray Train silent for 7 min Ray Train	1	466	January 7, 2022
TorchTrainer hangs when only 1 worker raises error	15	1053	November 2, 2022
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.checkpoint()` are called the same number of times on all workers Ray Train	1	691	April 16, 2022
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	315	May 15, 2023

Ray Trainer prepare_model gets stuck

Related topics