Will the Ray DAG/Workflows choose the best transport for data between two node?

Just like what UCX can do.
For example NCCL between GPUs, shared memory/IPC between two CPU workers if they are in one machine and RDMA/TCP directly without object storage if they are in different machine.
I know an additional hint could use NCCL, but it only support torch. How about Jax/Tensorflow?

What’s more, the transmission usually N producer to M consumer. How can I make this N to M pipe with DAG or WorkFlow?

Default Ray APIs don’t support it.

We are building a new API called ray compiled graph which is going to support something similar. But it is starting from

  • GPU to GPU using nccl when user annotate it
  • shared memory for local
  • will use higher performance transport multi node soon

Thank you for your reply. May I ask if multi-node transport DAG has a design draft?

It will be posted to the compiled graph channel. But the simple idea is to just use gloo under the hood

For Compiled Graphs multi-node communication, so far it supports NCCL transport (TCP) and shared-memory+RPC based transport, depending on the type hint passed in. As mentioned, we are working on an optimized mechanism.

Gloo is a collective communication library. It’s synchronization mechanism may cause some unnecessary overhead between different tasks.

we can use async apis to overlap compute/comm. we are already doing something similar with nccl (which is also collective).

Also the design is not final. We are open to hear if you have any recommendation on active message style high performance communication library that’s easy to support!

1 Like

Actually we’re doing something similar, in order to avoid doing the same job, maybe we can get involved in this in some way? In addition to LLM, we focus more on the application on the recommender system.
Are there any early open versions? Why choose Gloo? And what’s the key difference between compiled graph and Alpa(A MPMD Jax with Ray )

by the way, here is the comparison between UCX and Gloo: https://www.researchgate.net/publication/367280846_Supercharging_Distributed_Computing_Environments_For_High_Performance_Data_Engineering/figures?lo=1

I think Alpa can be implemented using compiled graphs. so it is more like lower level abstraction to express task graph that has GPU to GPU communication (and more optimized).

Actually we’re doing something similar, in order to avoid doing the same job, maybe we can get involved in this in some way? In addition to LLM, we focus more on the application on the recommender system.

btw, I may miss some context. what are you guys working on?

Make something like a Pathway with Ray. A MPMD framework which implement is one controller assign different SPMD meshes(For example, Jax, Megatron).

on interseting. maybe we can definitely talk in person? the motivation of this feature is very similar.