Hi,
I have a PyTorch script that trains in 1 GPU in a couple minutes.
I ported it to Ray Train to train over 4 GPUs, it printed the logs below, and then has been silent for 7 minutes, with GPU idle. Is there a verbose mode, for me to understand why things are slow/not happening?
(BaseWorkerMixin pid=6741) 2022-01-07 12:27:20,892 INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=6729) 2022-01-07 12:27:20,892 INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=6761) 2022-01-07 12:27:20,893 INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=6749) 2022-01-07 12:27:20,892 INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
2022-01-07 12:27:21,361 INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_12-27-16/run_001
(BaseWorkerMixin pid=6741) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6729) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6761) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6749) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6761) 2022-01-07 12:27:24,103 INFO torch.py:239 -- Moving model to device: cuda:3
(BaseWorkerMixin pid=6761) 2022-01-07 12:30:57,274 INFO torch.py:242 -- Wrapping provided model in DDP.
(it’s now 12:38 and no more logs have been created. GPU has never been active the whole time. I’m also curious why I have a “Moving model to device: cuda:3” while I need the model to go to all 4 devices?