Can't use GPUs on local cluster

mbusch-regis · October 31, 2023, 2:27pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We have installed Ray 2.7.1 on 4 local machines to make a cluster. One machine has 3 GPUs, one has 2, and two have 1 – all Nvidia. torch.cuda.is_available() shows True on all machines.

I’m trying to run the Train a PyTorch Model on Fashion MNIST example to learn how to train a model with Ray Train.

The example runs fine with use_gpu=False but reports that GPUs are available.

When run as written with use_gpu=True, it starts up then hangs at

(TorchTrainer pid=683931, ip=10.114.0.224) Starting distributed worker processes: [‘684131 (10.114.0.224)’, ‘684132 (10.114.0.224)’, ‘684133 (10.114.0.224)’, ‘3475878 (10.114.0.45)’]
(RayTrainWorker pid=684131, ip=10.114.0.224) Setting up process group for: env:// [rank=0, world_size=4]

and after 30 minutes:

ERROR tune_controller.py:1502 – Trial task failed for trial TorchTrainer_db6a2_00000
…
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)

Training errored after 0 iterations at 2023-10-31 00:16:47. Total running time: 30min 9s
Error file: /home/ray/ray_results/TorchTrainer_2023-10-30_23-46-37/TorchTrainer_db6a2_00000_0_2023-10-30_23-46-37/error.txt

Any ideas what could be wrong?

mbusch-regis · October 31, 2023, 7:41pm

Update: when I set num_workers=3, use_gpu=True it trains normally with all processes going to the 3-GPU machine.

So, apparently, the problem relates to parallelizing the training on more than 1 machine?

mbusch-regis · November 3, 2023, 7:20pm

Update 2:

When I take the 3-GPU machine out of the cluster, the job distributes across multiple machines just fine.

Continuing to troubleshoot.

Ret_Beauregard · September 11, 2024, 10:43pm

Can I see the config.yaml you used here? I am trying to run the same exact code with a similar setup, but something goes wrong and I can’e even get the training to start. See more here: Cuda Error: invalid device ordinal during training on GCP cluster

Topic		Replies	Views
CUDA-capable device(s) is/are busy or unavailable Ray Clusters	1	936	February 1, 2023
[Ray Core] RuntimeError: No CUDA GPUs are available Ray Core	5	4966	October 15, 2022
Checking if TorchTrainer is using the available GPUs Ray Train	2	471	December 6, 2023
Replicas can't connect to GPUs Ray Serve	9	1636	August 11, 2022
Ray Cluster, why does the program freeze and stop executing when the number of GPUs required by the program requires the GPUs of two machines Ray Clusters	0	285	January 14, 2023

Can't use GPUs on local cluster

Related topics