How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi! I am trying to use Ray Collective Communication Library for communication between distributed CPUs. I am using gloo for backend. I am getting the following error while running it.
NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2022-10-02 13:40:43,958 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.29.58.27:6379...
2022-10-02 13:40:43,963 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
(pid=8135) 2022-10-02 13:40:45,050 WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2022-10-02 13:40:45,117 WARNING worker.py:1829 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="a2d0ba68-b14a-4cc4-8fcc-f383b869e3ed", ...)
2022-10-02 13:40:45,124 WARNING worker.py:1829 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="a2d0ba68-b14a-4cc4-8fcc-f383b869e3ed", ...)
(pid=8137) 2022-10-02 13:40:45,119 WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
(pid=37650, ip=172.29.58.192) 2022-10-02 13:40:46,116 WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
Traceback (most recent call last):
File "demo_collective_communication_all_reduce.py", line 32, in <module>
_ = ray.get(init_rets)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2275, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::Worker.setup() (pid=8135, ip=172.29.58.27, repr=<demo_collective_communication_all_reduce.Worker object at 0x7f69ead45520>)
File "demo_collective_communication_all_reduce.py", line 14, in setup
collective.init_collective_group(world_size, rank, "gloo", "default")
File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective.py", line 148, in init_collective_group
_group_mgr.create_collective_group(backend, world_size, rank, group_name)
File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective.py", line 63, in create_collective_group
g = GLOOGroup(
File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 209, in __init__
self._rendezvous.meet()
File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 158, in meet
self._store.delKeys(keys)
AttributeError: 'RayInternalKvStore' object has no attribute 'del_keys'
This is the script, I am running.
import numpy as np
import ray
import ray.util.collective as col
from ray.util.collective.types import Backend, ReduceOp
import torch
@ray.remote(num_cpus=4)
class Worker:
def __init__(self):
self.buffer = None
self.list_buffer = None
def init_tensors(self):
self.buffer = np.ones((10,), dtype=np.float32)
self.list_buffer = [np.ones((10,), dtype=np.float32) for _ in range(2)]
return True
def init_group(self, world_size, rank, backend=Backend.NCCL, group_name="default"):
col.init_collective_group(world_size, rank, backend, group_name)
return True
def do_allreduce(self, group_name="default", op=ReduceOp.SUM):
col.allreduce(self.buffer, group_name, op)
return self.buffer
def create_collective_workers(num_workers=2, group_name="default", backend="nccl"):
actors = [None] * num_workers
for i in range(num_workers):
actor = Worker.remote()
ray.get([actor.init_tensors.remote()])
actors[i] = actor
world_size = num_workers
init_results = ray.get(
[
actor.init_group.remote(world_size, i, backend, group_name)
for i, actor in enumerate(actors)
]
)
return actors, init_results
world_size=2
group_name="default"
actors, _ = create_collective_workers(
num_workers=world_size, group_name=group_name, backend=Backend.GLOO
)
results = ray.get([a.do_allreduce.remote(group_name) for a in actors])
I am using
Ray==2.0.0
Pygloo from source as per (AttributeError: module 'pygloo.rendezvous' has no attribute 'CustomStore' - #4 by matthewdeng)
How could this be?