How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
We have installed Ray 2.7.1 on 4 local machines to make a cluster. One machine has 3 GPUs, one has 2, and two have 1 – all Nvidia. torch.cuda.is_available()
shows True on all machines.
I’m trying to run the Train a PyTorch Model on Fashion MNIST example to learn how to train a model with Ray Train.
The example runs fine with use_gpu=False
but reports that GPUs are available.
When run as written with use_gpu=True
, it starts up then hangs at
(TorchTrainer pid=683931, ip=10.114.0.224) Starting distributed worker processes: [‘684131 (10.114.0.224)’, ‘684132 (10.114.0.224)’, ‘684133 (10.114.0.224)’, ‘3475878 (10.114.0.45)’]
(RayTrainWorker pid=684131, ip=10.114.0.224) Setting up process group for: env:// [rank=0, world_size=4]
and after 30 minutes:
ERROR tune_controller.py:1502 – Trial task failed for trial TorchTrainer_db6a2_00000
…
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)Training errored after 0 iterations at 2023-10-31 00:16:47. Total running time: 30min 9s
Error file: /home/ray/ray_results/TorchTrainer_2023-10-30_23-46-37/TorchTrainer_db6a2_00000_0_2023-10-30_23-46-37/error.txt
Any ideas what could be wrong?