Can't use GPUs on local cluster

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We have installed Ray 2.7.1 on 4 local machines to make a cluster. One machine has 3 GPUs, one has 2, and two have 1 – all Nvidia. torch.cuda.is_available() shows True on all machines.

I’m trying to run the Train a PyTorch Model on Fashion MNIST example to learn how to train a model with Ray Train.

The example runs fine with use_gpu=False but reports that GPUs are available.

When run as written with use_gpu=True, it starts up then hangs at

(TorchTrainer pid=683931, ip=10.114.0.224) Starting distributed worker processes: [‘684131 (10.114.0.224)’, ‘684132 (10.114.0.224)’, ‘684133 (10.114.0.224)’, ‘3475878 (10.114.0.45)’]
(RayTrainWorker pid=684131, ip=10.114.0.224) Setting up process group for: env:// [rank=0, world_size=4]

and after 30 minutes:

ERROR tune_controller.py:1502 – Trial task failed for trial TorchTrainer_db6a2_00000

RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)

Training errored after 0 iterations at 2023-10-31 00:16:47. Total running time: 30min 9s
Error file: /home/ray/ray_results/TorchTrainer_2023-10-30_23-46-37/TorchTrainer_db6a2_00000_0_2023-10-30_23-46-37/error.txt

Any ideas what could be wrong?

Update: when I set num_workers=3, use_gpu=True it trains normally with all processes going to the 3-GPU machine.

So, apparently, the problem relates to parallelizing the training on more than 1 machine?

Update 2:

When I take the 3-GPU machine out of the cluster, the job distributes across multiple machines just fine.

Continuing to troubleshoot.