Ray Train doesn't detect GPU


I’m using Ray Train to train a PyTorch model on an EC2 g4dn.12xlarge (4*NVIDIA T4)

In addition to the model prep and dataloader prep function, I’m adding a

    trainer = Trainer(backend="torch", num_workers=4)

however it seems that Ray doesn’t detect GPUs:

(BaseWorkerMixin pid=3904) 2022-01-07 11:14:13,538	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=3920) 2022-01-07 11:14:13,537	INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=3917) 2022-01-07 11:14:13,535	INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:13,537	INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
2022-01-07 11:14:14,571	INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_11-14-09/run_001
(BaseWorkerMixin pid=3904) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3920) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3917) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3929) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3929) /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
(BaseWorkerMixin pid=3929)   warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:17,108	INFO torch.py:239 -- Moving model to device: cpu
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:17,111	INFO torch.py:242 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=3904) /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
(BaseWorkerMixin pid=3904)   warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
(BaseWorkerMixin pid=3904) 2022-01-07 11:14:17,217	INFO torch.py:239 -- Moving model to device: cpu

I see in the execution logs a “Moving model to device: cpu” and a "warnings.warn(“torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.”

Is there something specific to do for Ray Train to be aware of GPUs?

a cuda check in the script returns True though… It’s really from within the Ray workers that GPUs are not found

    print(f'GPU available {torch.cuda.is_available()}')
    trainer = Trainer(backend="torch", num_workers=4)

ok I just saw in the Trainer doc that use_gpu defaults to False
I specified it - trainer = Trainer(backend="torch", num_workers=4, use_gpu=True)
and now Ray Train correctly uses GPU

INFO torch.py:239 -- Moving model to device: cuda:3

Maybe worth adding here than use_gpu should be set to True for GPU training

Oh wow, I’m a little surprised that this isn’t explicitly mentioned in the user guide! Thanks for sharing your entire thought process here, I’ll make an update to the docs.

1 Like

That’s right @Lacruche, you need to set use_gpu to True to enable GPU Training. And also good point about the docs.

@matthewdeng made a PR here [Train] Improve usability for GPU Training by amogkam · Pull Request #21464 · ray-project/ray · GitHub

1 Like