Will the Ray DAG/Workflows choose the best transport for data between two node?

MoFHeka · October 23, 2024, 7:31pm

Just like what UCX can do.
For example NCCL between GPUs, shared memory/IPC between two CPU workers if they are in one machine and RDMA/TCP directly without object storage if they are in different machine.
I know an additional hint could use NCCL, but it only support torch. How about Jax/Tensorflow?

What’s more, the transmission usually N producer to M consumer. How can I make this N to M pipe with DAG or WorkFlow?

sangcho · October 24, 2024, 4:21am

Default Ray APIs don’t support it.

We are building a new API called ray compiled graph which is going to support something similar. But it is starting from

GPU to GPU using nccl when user annotate it
shared memory for local
will use higher performance transport multi node soon

MoFHeka · October 24, 2024, 2:29pm

Thank you for your reply. May I ask if multi-node transport DAG has a design draft?

sangcho · October 24, 2024, 6:08pm

It will be posted to the compiled graph channel. But the simple idea is to just use gloo under the hood

ruisearch42 · October 24, 2024, 6:57pm

For Compiled Graphs multi-node communication, so far it supports NCCL transport (TCP) and shared-memory+RPC based transport, depending on the type hint passed in. As mentioned, we are working on an optimized mechanism.

MoFHeka · October 25, 2024, 3:53am

Gloo is a collective communication library. It’s synchronization mechanism may cause some unnecessary overhead between different tasks.

sangcho · October 26, 2024, 4:35am

we can use async apis to overlap compute/comm. we are already doing something similar with nccl (which is also collective).

Also the design is not final. We are open to hear if you have any recommendation on active message style high performance communication library that’s easy to support!

MoFHeka · October 27, 2024, 7:35am

Actually we’re doing something similar, in order to avoid doing the same job, maybe we can get involved in this in some way? In addition to LLM, we focus more on the application on the recommender system.
Are there any early open versions? Why choose Gloo? And what’s the key difference between compiled graph and Alpa(A MPMD Jax with Ray )

by the way, here is the comparison between UCX and Gloo: https://www.researchgate.net/publication/367280846_Supercharging_Distributed_Computing_Environments_For_High_Performance_Data_Engineering/figures?lo=1

sangcho · October 31, 2024, 3:40pm

I think Alpa can be implemented using compiled graphs. so it is more like lower level abstraction to express task graph that has GPU to GPU communication (and more optimized).

sangcho · October 31, 2024, 3:42pm

Actually we’re doing something similar, in order to avoid doing the same job, maybe we can get involved in this in some way? In addition to LLM, we focus more on the application on the recommender system.

btw, I may miss some context. what are you guys working on?

MoFHeka · November 1, 2024, 10:45am

Make something like a Pathway with Ray. A MPMD framework which implement is one controller assign different SPMD meshes(For example, Jax, Megatron).

sangcho · November 4, 2024, 4:48pm

on interseting. maybe we can definitely talk in person? the motivation of this feature is very similar.

Topic		Replies	Views
How to make a multi-input node in DAG? Ray Core	13	152	September 25, 2025
Several questions about DL training (e.g. alexnet with pytorch) Ray Core	2	325	July 12, 2021
Ray.util.collective uses for what circumstance? Ray Core	4	331	May 27, 2022
Distributing NetworkX using RAY Ray Core	1	1260	April 6, 2021
Question about Ray for HPC Ray Core	2	1309	March 21, 2021

Will the Ray DAG/Workflows choose the best transport for data between two node?

Related topics