State and direction of collectives library

based on a chat with @zhz , posting some basic collective-related questions here:

collective lib has quite some gaps re various operations, according to the docs. is there an intent to close them (i.e. is there an intent to position ray strategically in the mpi space)?

Besides that, the docs claim very differentiated performance of the collective lib vs ‘traditional’ ray. Do you know what the secret sauce is to make this so much faster (and if so, why that secret sauce is not / cannot be applied to Ray in general?

1 Like

@zhisbug @Stephanie_Wang Can probably help here.

There was also this recent SIGCOMM paper https://dl.acm.org/doi/10.1145/3452296.3472897

Hi @mbehrendt

In terms of the gaps you mentioned on supported ops, according to this table, I wouldn’t say the gap is “quite some”. We support almost all collective and P2P ops on both CPU and GPU despite that we use different backends (GLOO or NCCL) to realize them – that’s why you see, for examples, GLOO does not have a GPU version while NCCL does not have a CPU version.

The tables show three collectives are missing: gather, scatter, and all-to-all.

For gather/scatter, users can easily compose them using p2p send/recv (which have been supported on both CPUs and GPUs in the current lib); there will be no performance difference even if we provide an expert implementation due to the nature of gather/scatter.

Hence, the only missing collective operation is all-to-all CPU/GPU versions, which we’ll add soon.

For your question about why the performance increases drastically, I think a TL;DR version is that in the lower level those collective operations are better optimized against the specific hardware (e.g., GPUs), communication patterns, and network switches. If you’re interested in knowing more, might refer to my talk about this lib at Ray summit.

For MPI, for now, we do not have a plan to support MPI for two reasons:

  • GLOO is a better alternative to MPI and fulfills its 99% functionalities.
  • the design of MPI (esp. the launch part) has some fundamental discrepancy from ray; adding their compatibility is non-trivial. See this RFC for some details.

Hope these answer your questions.

Re:

For gather/scatter, users can easily compose them using p2p send/recv (which have been supported on both CPUs and GPUs in the current lib); there will be no performance difference even if we provide an expert implementation due to the nature of gather/scatter.

Is there an example available of using p2p send/recv in lieu of gather? I’ve used python MPI on an hpc before but new to gloo and nccl. Thanks