RuntimeError: Unable to meet other processes at the rendezvous store. If you are using P2P communication, please check if tensors are put in the correct GPU

Sc_Jin · February 21, 2025, 3:25am

2025-02-21 02:52:05,069 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.do_allreduce() (pid=180023, ip=172.20.20.220, repr=<geesibling.adapters.jax.pipeline.devicecontext.MeshHostWorker object at 0x7fb14c7bfee0>)
  File"/root/jinsc/geesibling_PPDP/python/geesibling/adapters/jax/pipeline/devicecontext.py", line 517, in do_allreduce
    col.allreduce_multigpu([concatenated_allreduce_buffer], group_name=group_name)
  File "/root/miniconda3/envs/framework-jinsc/lib/python3.9/site-packages/ray/util/collective/collective.py", line 295, in allreduce_multigpu
    g.allreduce(tensor_list, opts)
  File "/root/miniconda3/envs/framework-jinsc/lib/python3.9/site-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 197, in allreduce
    self._collective(tensors, tensors, collective_fn)
  File "/root/miniconda3/envs/framework-jinsc/lib/python3.9/site-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 604, in _collective
    comms = self._get_nccl_collective_communicator(key, devices)
  File "/root/miniconda3/envs/framework-jinsc/lib/python3.9/site-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 431, in _get_nccl_collective_communicator
    rendezvous.meet()
  File "/root/miniconda3/envs/framework-jinsc/lib/python3.9/site-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 89, in meet
    raise RuntimeError(
RuntimeError: Unable to meet other processes at the rendezvous store. If you are using P2P communication, please check if tensors are put in the correct GPU.

When I use ‘col.allreduce_multigpu’, I get the above error. The same is true when using the ‘col.recv_multigpu’.
Several of my communication processes are created through this function, for example：

def do_allreduce(self, var1):
   ...
   with cupy.cuda.Device(0):
       var2 = cupy.array(var1)
       col.allreduce_multigpu([var2], group_name=group_name)
       cupy.cuda.Device(0).synchronize()
       var3 = concatenated_recv_buffer.get()
   ...

Why did this error occur? And my ray version is 2.1.0 and nccl version is 2.16.2.

Topic		Replies	Views
Ray collective communication test error Ray Core	1	298	May 26, 2022
### ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect) After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with	0	948	March 14, 2023
Nccl errors why using Ray Ray Core	2	580	September 12, 2022
AttributeError: 'RayInternalKvStore' object has no attribute 'del_keys'	1	518	October 7, 2022
Error: RuntimeError: No rendezvous handler for env:// Ray Train	5	816	April 5, 2023

RuntimeError: Unable to meet other processes at the rendezvous store. If you are using P2P communication, please check if tensors are put in the correct GPU

Related topics