Ray Train silent for 7 min

Lacruche · January 7, 2022, 12:39pm

Hi,

I have a PyTorch script that trains in 1 GPU in a couple minutes.
I ported it to Ray Train to train over 4 GPUs, it printed the logs below, and then has been silent for 7 minutes, with GPU idle. Is there a verbose mode, for me to understand why things are slow/not happening?

(BaseWorkerMixin pid=6741) 2022-01-07 12:27:20,892	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=6729) 2022-01-07 12:27:20,892	INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=6761) 2022-01-07 12:27:20,893	INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=6749) 2022-01-07 12:27:20,892	INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
2022-01-07 12:27:21,361	INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_12-27-16/run_001
(BaseWorkerMixin pid=6741) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6729) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6761) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6749) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=6761) 2022-01-07 12:27:24,103	INFO torch.py:239 -- Moving model to device: cuda:3
(BaseWorkerMixin pid=6761) 2022-01-07 12:30:57,274	INFO torch.py:242 -- Wrapping provided model in DDP.

(it’s now 12:38 and no more logs have been created. GPU has never been active the whole time. I’m also curious why I have a “Moving model to device: cuda:3” while I need the model to go to all 4 devices?

matthewdeng · January 7, 2022, 4:47pm

Yeah this does look like something is hanging - one of the workers is executing the model preparation logic while the others are not. My intuition tells me there’s some sort of deadlock/contention going on here, but it’s not immediately clear to me where.

Could you share a reproducible script?

Another way you can debug is to use py-spy on the pid to see where exactly the script is stuck (e.g. for (BaseWorkerMixin pid=6741) you can run py-spy dump --pid 6741). This should at least provide some hint as to what’s causing this.

Topic		Replies	Views
Ray Train doesn't detect GPU Ray Train	4	1981	January 7, 2022
Training time not decreasing with more workers Ray Train	2	23	March 19, 2025
Ray Train creates TypeError: 'generator' object is not subscriptable	10	2468	January 7, 2022
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1104	January 12, 2022
Can't use GPUs on local cluster Ray Clusters	3	665	September 11, 2024

Ray Train silent for 7 min

Related topics