Ray Train silent for 7 min

Hi,

I have a PyTorch script that trains in 1 GPU in a couple minutes.
I ported it to Ray Train to train over 4 GPUs, it printed the logs below, and then has been silent for 7 minutes, with GPU idle. Is there a verbose mode, for me to understand why things are slow/not happening?

(BaseWorkerMixin pid=6741) 2022-01-07 12:27:20,892	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=6729) 2022-01-07 12:27:20,892	INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=6761) 2022-01-07 12:27:20,893	INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=6749) 2022-01-07 12:27:20,892	INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
2022-01-07 12:27:21,361	INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_12-27-16/run_001
(BaseWorkerMixin pid=6741) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6729) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6761) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6749) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6761) 2022-01-07 12:27:24,103	INFO torch.py:239 -- Moving model to device: cuda:3
(BaseWorkerMixin pid=6761) 2022-01-07 12:30:57,274	INFO torch.py:242 -- Wrapping provided model in DDP.

(it’s now 12:38 and no more logs have been created. GPU has never been active the whole time. I’m also curious why I have a “Moving model to device: cuda:3” while I need the model to go to all 4 devices?

Yeah this does look like something is hanging - one of the workers is executing the model preparation logic while the others are not. My intuition tells me there’s some sort of deadlock/contention going on here, but it’s not immediately clear to me where.

Could you share a reproducible script?

Another way you can debug is to use py-spy on the pid to see where exactly the script is stuck (e.g. for (BaseWorkerMixin pid=6741) you can run py-spy dump --pid 6741). This should at least provide some hint as to what’s causing this.