AttributeError: 'RayInternalKvStore' object has no attribute 'del_keys'

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi! I am trying to use Ray Collective Communication Library for communication between distributed CPUs. I am using gloo for backend. I am getting the following error while running it.

NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2022-10-02 13:40:43,958	INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.29.58.27:6379...
2022-10-02 13:40:43,963	INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(pid=8135) 2022-10-02 13:40:45,050	WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2022-10-02 13:40:45,117	WARNING worker.py:1829 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="a2d0ba68-b14a-4cc4-8fcc-f383b869e3ed", ...)
2022-10-02 13:40:45,124	WARNING worker.py:1829 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="a2d0ba68-b14a-4cc4-8fcc-f383b869e3ed", ...)
(pid=8137) 2022-10-02 13:40:45,119	WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
(pid=37650, ip=172.29.58.192) 2022-10-02 13:40:46,116	WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
Traceback (most recent call last):
  File "demo_collective_communication_all_reduce.py", line 32, in <module>
    _ = ray.get(init_rets)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2275, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::Worker.setup() (pid=8135, ip=172.29.58.27, repr=<demo_collective_communication_all_reduce.Worker object at 0x7f69ead45520>)
  File "demo_collective_communication_all_reduce.py", line 14, in setup
    collective.init_collective_group(world_size, rank, "gloo", "default")
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective.py", line 148, in init_collective_group
    _group_mgr.create_collective_group(backend, world_size, rank, group_name)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective.py", line 63, in create_collective_group
    g = GLOOGroup(
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 209, in __init__
    self._rendezvous.meet()
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 158, in meet
    self._store.delKeys(keys)
AttributeError: 'RayInternalKvStore' object has no attribute 'del_keys'

This is the script, I am running.

import numpy as np

import ray
import ray.util.collective as col
from ray.util.collective.types import Backend, ReduceOp

import torch


@ray.remote(num_cpus=4)
class Worker:
    def __init__(self):
        self.buffer = None
        self.list_buffer = None

    def init_tensors(self):
        self.buffer = np.ones((10,), dtype=np.float32)
        self.list_buffer = [np.ones((10,), dtype=np.float32) for _ in range(2)]
        return True

    def init_group(self, world_size, rank, backend=Backend.NCCL, group_name="default"):
        col.init_collective_group(world_size, rank, backend, group_name)
        return True

    def do_allreduce(self, group_name="default", op=ReduceOp.SUM):
        col.allreduce(self.buffer, group_name, op)
        return self.buffer



def create_collective_workers(num_workers=2, group_name="default", backend="nccl"):
    actors = [None] * num_workers
    for i in range(num_workers):
        actor = Worker.remote()
        ray.get([actor.init_tensors.remote()])
        actors[i] = actor
    world_size = num_workers
    init_results = ray.get(
        [
            actor.init_group.remote(world_size, i, backend, group_name)
            for i, actor in enumerate(actors)
        ]
    )
    return actors, init_results

world_size=2
group_name="default"
actors, _ = create_collective_workers(
    num_workers=world_size, group_name=group_name, backend=Backend.GLOO
)
results = ray.get([a.do_allreduce.remote(group_name) for a in actors])

I am using
Ray==2.0.0
Pygloo from source as per (AttributeError: module 'pygloo.rendezvous' has no attribute 'CustomStore' - #4 by matthewdeng)

How could this be?

Fix added: fix missing APIs in gloo kvstore by jiaodong · Pull Request #29084 · ray-project/ray · GitHub