Resource Allocation Issue When Using TorchTrainer with Tuner

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Problem Description
I want to use TorchTrainer as a function of Tuner for hyperparameter tuning, but I’ve encountered an issue.
In TorchTrainer, I want to set two num_workers, meaning I want to use DDP for parallel training. I want to run multiple trials concurrently, so I set 0.49 GPUs per num_worker in the ScalingConfig. Assuming I have two GPUs, I expect to be able to run two trials simultaneously, with each trial using 0.49 of GPU0 and GPU1.
However, Ray seems to assign a single GPU to both workers of one trial, causing an NCCL runtime error: duplicate GPU detected: rank 0 and rank 1 both on CUDA device 2000. Is there any way to resolve this issue? BTW, I know that adjusting Torch’s backend to Gloo allows multiple workers of the same trial to run on the same GPU, but I want to know if this can be achieved with NCCL as well.