Ray.util.collective uses for what circumstance?

xyzyx · May 25, 2022, 2:21am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I see some collective communication verbs in there. It contains send, recv and reduce,etc. These functions loos like MPI functions. I search these functions in the ray project but these functions are just used in some test scripts. And these functions operate tensors from GPU to GPU.

Are these functions used in ray’s other libs like rllib?
If I want to transfer data from node to node, are there some functions like ray.util.collective? If not, it means transferring data should use object reference?

Chen_Shen · May 25, 2022, 11:25pm

@xyzyx Ray collective library is purely experimental, so we haven’t used it for any ray libraries. With that said, today if you want to transfer data you should use ray object-store.

However, are you running into performance issues with ray object-store-based object transfer?

xyzyx · May 26, 2022, 1:16am

Thanks, @Chen_Shen !
Actually, I have access to a cluster with InfinityBand and I’m familiar with MPI. I see ray’s code contains collective communication so I want to make full use of hardware. after I review these code, I think the collective communication focus on GPU data transfer. Is that true?
As you mentioned, in Ray data transfer is based on object-store. Is it plasma?
In Ray’s related documents, Ray’s control flow and data flow are separate. The control flow is based on gRPC and the data flow is based on distributed memory management like Plasma. Am I right?

Mingwei · May 26, 2022, 5:48pm

Hi @xyzyx, yes, Ray’s existing collective communication implementation focuses on GPU computation and GPU memory transfer.

Ray’s distributed object store is built on top of Ray’s per-node plasma store. Ray’s plasma store is a modified version of arrow’s plasma store.

Ray’s distributed memory management uses gRPC to transfer data too. There are Ray users that only use Ray’s control plane (creating and calling actors), but implement their own data flow between Ray actors (for efficient data transfer between GPUs). We are working on improving Ray’s GPU collective communication primitives, so most users can keep using Ray’s data flow implementation for efficient GPU computations.

xyzyx · May 27, 2022, 1:44am

Thanks, @Mingwei !

As you mentioned, Ray’s distributed memory management controls the data transferring. If I want to help to porting Ray’s data transfer to InfinityBand, It means I need to make gRPC supporting InfinityBand. Am I right?

Now Ray focuses on GPU collective communication primitives. Are there any plans to focus on CPU collective communication primitives? If not, does it mean that distributed memory management to handle data transfer is enough?

Is there any paper or document to show how distributed memory management works?

Topic		Replies	Views
What use for data transferring? Ray Core	9	1100	December 2, 2022
How is large data copied between two nodes? Ray Core	1	578	November 30, 2021
State and direction of collectives library	3	524	March 8, 2022
Question about Ray for HPC Ray Core	2	1315	March 21, 2021
Ray.get() on Torch CUDA tensors Ray Core	7	1152	August 11, 2022

Ray.util.collective uses for what circumstance?

Related topics