Ray train parallelize on single GPU

MakGulati · July 19, 2023, 2:57pm

is it possible to parallelize on one GPU with multiple works with pytorch?
similar to this (but this doesnot work)

scaling_config = ScalingConfig(num_workers=5, use_gpu=True)
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)

MakGulati · July 19, 2023, 3:03pm

if i do this:
scaling_config = ScalingConfig(num_workers=5, use_gpu=True,resources_per_worker={"GPU":0.2})

i get this error: > torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3

ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 65000

kai · July 19, 2023, 3:17pm

NCCL does not work on a single GPU. You could try using gloo instead:

trainer = TorchTrainer(
    # ...,
    torch_config=TorchConfig(backend="gloo"),
)

That said, what are you trying to do here? The throughput should be higher if you just use the GPU in full.

MakGulati · July 19, 2023, 3:52pm

@kai
i want to try Federated learning simulation on single GPU by paralleling client and server workflow on workers and allow exchange of model parameters

kai · July 24, 2023, 2:22pm

How does your training loop look like?

Topic		Replies	Views
When to use multi gpus per worker for a training job	1	229	September 15, 2024
Resource Allocation Issue When Using TorchTrainer with Tuner Ray Tune	0	29	August 17, 2024
Scaling Ray Train in PyTorch with multiple GPUs per Worker: AttributeError Issue Ray Train	2	640	September 13, 2024
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	549	April 10, 2023
Multi-gpu ray tune for hparams not parallelizing and only using first gpu	0	80	July 10, 2024

Ray train parallelize on single GPU

Related topics