Ray Train doesn't detect GPU

Lacruche · January 7, 2022, 11:22am

Hi,

I’m using Ray Train to train a PyTorch model on an EC2 g4dn.12xlarge (4*NVIDIA T4)

In addition to the model prep and dataloader prep function, I’m adding a

    trainer = Trainer(backend="torch", num_workers=4)
    trainer.start()
    trainer.run(train_func)
    trainer.shutdown()

however it seems that Ray doesn’t detect GPUs:

(BaseWorkerMixin pid=3904) 2022-01-07 11:14:13,538	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=3920) 2022-01-07 11:14:13,537	INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=3917) 2022-01-07 11:14:13,535	INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:13,537	INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
2022-01-07 11:14:14,571	INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_11-14-09/run_001
(BaseWorkerMixin pid=3904) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3920) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3917) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3929) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3929) /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
(BaseWorkerMixin pid=3929)   warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:17,108	INFO torch.py:239 -- Moving model to device: cpu
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:17,111	INFO torch.py:242 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=3904) /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
(BaseWorkerMixin pid=3904)   warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
(BaseWorkerMixin pid=3904) 2022-01-07 11:14:17,217	INFO torch.py:239 -- Moving model to device: cpu

I see in the execution logs a “Moving model to device: cpu” and a "warnings.warn(“torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.”

Is there something specific to do for Ray Train to be aware of GPUs?

Lacruche · January 7, 2022, 11:50am

a cuda check in the script returns True though… It’s really from within the Ray workers that GPUs are not found


    print(f'GPU available {torch.cuda.is_available()}')
    trainer = Trainer(backend="torch", num_workers=4)
    trainer.start()
    trainer.run(train_func)
    trainer.shutdown()

Lacruche · January 7, 2022, 12:29pm

ok I just saw in the Trainer doc that use_gpu defaults to False
I specified it - trainer = Trainer(backend="torch", num_workers=4, use_gpu=True)
and now Ray Train correctly uses GPU

INFO torch.py:239 -- Moving model to device: cuda:3

Maybe worth adding here than use_gpu should be set to True for GPU training

matthewdeng · January 7, 2022, 4:53pm

Oh wow, I’m a little surprised that this isn’t explicitly mentioned in the user guide! Thanks for sharing your entire thought process here, I’ll make an update to the docs.

amogkam · January 7, 2022, 5:29pm

That’s right @Lacruche, you need to set use_gpu to True to enable GPU Training. And also good point about the docs.

@matthewdeng made a PR here [Train] Improve usability for GPU Training by amogkam · Pull Request #21464 · ray-project/ray · GitHub

Topic		Replies	Views
RaySGD fails to find GPUs Ray Train	1	465	December 6, 2021
Ray Train silent for 7 min Ray Train	1	466	January 7, 2022
Can't use GPUs on local cluster Ray Clusters	3	610	September 11, 2024
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1067	January 12, 2022
How to launch multi-node job with Ray Train? Ray Train	9	1919	June 14, 2024

Ray Train doesn't detect GPU

Related topics