Using Ray over InfiniBand

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
    I use use a slurm script in a cluster to use ray. The cluster has InfiniBand. InfiniBand is a channel-based fabric that facilitates high-speed communications between interconnected nodes. I know the ray is built on gRPC which uses TCP/IP. But can ray run over InfiniBand? Is there any plan to support using ray over InfiniBand?

No plans for the foreseeable future! But I can imagine some paths that could be beneficial (like object transfer).

Hey @Chen_Shen aren’t we planning to support this?

This mainly just requires the ability to specify a particular network card.

hey @xyzyx do you have the setup that you want to use InfiniBand?
Currently, Ray doesn’t support InfiniBand, mainly because there is yet a standardized, easy-to-use API to use it, compared to ethernet. However, you might able to still use it via Ethernet over InfiniBand.

That’s said, we’d be happy to learn more about your use case and explore this option.

I’m using the IB network interface (instead of the default) to communicate between the Ray workers in a Slurm cluster. I imagined that would make the communication faster. Isn’t that the case?

@vakker00 yeah if it’s ethernet over IB network it’s planning to be supported. [Feature] [core] Selecting network interface · Issue #22732 · ray-project/ray · GitHub

Hello @rliaw . Infiniband support would be great. Usually people instead of supporting directly the verbs API uses UCX (dask for instance has a backend using py-ucx) or libfabrics (for instance on AWS is the way to use the EFA networking stack on both mpi and nccl applications) as a targets. There are some projects like https://mercury-hpc.github.io/ that tries to provide a RPC interface that supports this kind of hardware or the tensorpipe library that also supports this kind of interface but is focused on point to point communications of tensors. However most of these libraries are somewhat low level than the gRPC one.

Yes, Using the IB network can make communication faster. But some features like RDMA are not fully used. I wonder if Ray can use these features to accelerate the speed of communication.