if i do this: scaling_config = ScalingConfig(num_workers=5, use_gpu=True,resources_per_worker={"GPU":0.2})
i get this error: > torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 65000
@kai
i want to try Federated learning simulation on single GPU by paralleling client and server workflow on workers and allow exchange of model parameters