### ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect) After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with

NavinKumarMNK · March 14, 2023, 4:18pm

After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with 2 process with backed nccl

NCCL INFO :

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=423719, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=508760) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=508760) distributed_backend=nccl
(RayExecutor pid=508760) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=508760) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=508760) 
(RayExecutor pid=508760) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]).
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0>
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO cudaDriverVersion 11070
(RayExecutor pid=508760) NCCL version 2.14.3+cuda11.7

But as soon as this message i am getting an nccInternalError : Internal check failed

RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=508760, ip=172.16.96.59, 
repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7fa16a4327d0>)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", 
line 52, in execute
    return fn(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", 
line 301, in _wrapping_function
    results = function(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1172, in _run
    self.__setup_profiler()
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", 
line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
2084, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
1400, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: 
/opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, 
internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)
(ray) windows@hostssh:~/Video-Detection$ nvidia-smi
Tue Mar 14 20:40:29 2023

I am running in on premise cluster without any containerization . And sinlge gpu code works successfully (with 16 batch size). so i need to do model parallel

Topic		Replies	Views
Nccl errors why using Ray Ray Core	2	634	September 12, 2022
Ray Train with DDP on multi-node set-up	2	987	September 11, 2024
Any suggestions on how to debug the distributed torch trainer Dashboard, Monitoring & Debugging	7	904	June 9, 2021
Distributed pytorch on cluster Ray Clusters	4	563	June 9, 2021
TorchTrainer: Collective operation timeout: WorkNCCL	2	1853	July 18, 2023

### ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect) After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with

Related topics