Hi,
I’m using Ray Train to train a PyTorch model on an EC2 g4dn.12xlarge (4*NVIDIA T4)
In addition to the model prep and dataloader prep function, I’m adding a
trainer = Trainer(backend="torch", num_workers=4)
trainer.start()
trainer.run(train_func)
trainer.shutdown()
however it seems that Ray doesn’t detect GPUs:
(BaseWorkerMixin pid=3904) 2022-01-07 11:14:13,538 INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=3920) 2022-01-07 11:14:13,537 INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=3917) 2022-01-07 11:14:13,535 INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:13,537 INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
2022-01-07 11:14:14,571 INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_11-14-09/run_001
(BaseWorkerMixin pid=3904) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3920) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3917) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3929) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=3929) /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
(BaseWorkerMixin pid=3929) warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.")
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:17,108 INFO torch.py:239 -- Moving model to device: cpu
(BaseWorkerMixin pid=3929) 2022-01-07 11:14:17,111 INFO torch.py:242 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=3904) /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
(BaseWorkerMixin pid=3904) warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.")
(BaseWorkerMixin pid=3904) 2022-01-07 11:14:17,217 INFO torch.py:239 -- Moving model to device: cpu
I see in the execution logs a “Moving model to device: cpu” and a "warnings.warn(“torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.”
Is there something specific to do for Ray Train to be aware of GPUs?