Just like what UCX can do.
For example NCCL between GPUs, shared memory/IPC between two CPU workers if they are in one machine and RDMA/TCP directly without object storage if they are in different machine.
I know an additional hint could use NCCL, but it only support torch. How about Jax/Tensorflow?
What’s more, the transmission usually N producer to M consumer. How can I make this N to M pipe with DAG or WorkFlow?
For Compiled Graphs multi-node communication, so far it supports NCCL transport (TCP) and shared-memory+RPC based transport, depending on the type hint passed in. As mentioned, we are working on an optimized mechanism.
we can use async apis to overlap compute/comm. we are already doing something similar with nccl (which is also collective).
Also the design is not final. We are open to hear if you have any recommendation on active message style high performance communication library that’s easy to support!
Actually we’re doing something similar, in order to avoid doing the same job, maybe we can get involved in this in some way? In addition to LLM, we focus more on the application on the recommender system.
Are there any early open versions? Why choose Gloo? And what’s the key difference between compiled graph and Alpa(A MPMD Jax with Ray )
I think Alpa can be implemented using compiled graphs. so it is more like lower level abstraction to express task graph that has GPU to GPU communication (and more optimized).
Actually we’re doing something similar, in order to avoid doing the same job, maybe we can get involved in this in some way? In addition to LLM, we focus more on the application on the recommender system.
btw, I may miss some context. what are you guys working on?