Resource Allocation Issue When Using TorchTrainer with Tuner

LY_Omega · August 17, 2024, 1:37am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Problem Description
I want to use TorchTrainer as a function of Tuner for hyperparameter tuning, but I’ve encountered an issue.
In TorchTrainer, I want to set two num_workers, meaning I want to use DDP for parallel training. I want to run multiple trials concurrently, so I set 0.49 GPUs per num_worker in the ScalingConfig. Assuming I have two GPUs, I expect to be able to run two trials simultaneously, with each trial using 0.49 of GPU0 and GPU1.
However, Ray seems to assign a single GPU to both workers of one trial, causing an NCCL runtime error: duplicate GPU detected: rank 0 and rank 1 both on CUDA device 2000. Is there any way to resolve this issue? BTW, I know that adjusting Torch’s backend to Gloo allows multiple workers of the same trial to run on the same GPU, but I want to know if this can be achieved with NCCL as well.

Topic		Replies	Views
Ray Train/Tune issue: concurrent trials conflict on GPU nodes Ray Tune	2	50	February 12, 2025
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	551	April 10, 2023
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	312	May 15, 2023
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	433	February 19, 2021
Ray train parallelize on single GPU	4	1723	July 24, 2023

Resource Allocation Issue When Using TorchTrainer with Tuner

Related topics