How severe does this issue affect your experience of using Ray?
Low: It annoys or frustrates me for a moment.
I see some collective communication verbs in there. It contains send, recv and reduce,etc. These functions loos like MPI functions. I search these functions in the ray project but these functions are just used in some test scripts. And these functions operate tensors from GPU to GPU.
Are these functions used in ray’s other libs like rllib?
If I want to transfer data from node to node, are there some functions like ray.util.collective? If not, it means transferring data should use object reference?
@xyzyx Ray collective library is purely experimental, so we haven’t used it for any ray libraries. With that said, today if you want to transfer data you should use ray object-store.
However, are you running into performance issues with ray object-store-based object transfer?
Thanks, @Chen_Shen !
Actually, I have access to a cluster with InfinityBand and I’m familiar with MPI. I see ray’s code contains collective communication so I want to make full use of hardware. after I review these code, I think the collective communication focus on GPU data transfer. Is that true?
As you mentioned, in Ray data transfer is based on object-store. Is it plasma?
In Ray’s related documents, Ray’s control flow and data flow are separate. The control flow is based on gRPC and the data flow is based on distributed memory management like Plasma. Am I right?
Hi @xyzyx, yes, Ray’s existing collective communication implementation focuses on GPU computation and GPU memory transfer.
Ray’s distributed object store is built on top of Ray’s per-node plasma store. Ray’s plasma store is a modified version of arrow’s plasma store.
Ray’s distributed memory management uses gRPC to transfer data too. There are Ray users that only use Ray’s control plane (creating and calling actors), but implement their own data flow between Ray actors (for efficient data transfer between GPUs). We are working on improving Ray’s GPU collective communication primitives, so most users can keep using Ray’s data flow implementation for efficient GPU computations.
As you mentioned, Ray’s distributed memory management controls the data transferring. If I want to help to porting Ray’s data transfer to InfinityBand, It means I need to make gRPC supporting InfinityBand. Am I right?
Now Ray focuses on GPU collective communication primitives. Are there any plans to focus on CPU collective communication primitives? If not, does it mean that distributed memory management to handle data transfer is enough?
Is there any paper or document to show how distributed memory management works?