Any way to launch mpi programmatically within ray?

Hi everyone,

Want to ask is there any way to launch mpi within ray actors in Python code, namely can the ray actor processes form an mpi communicator without using mpiexec/mpirun for the launch?
I find some related issues: [rllib][Ray] Running MPI compliant software as a custom environment together with rllib · Issue #10975 · ray-project/ray · GitHub
and Ray initialization in MPI environment · Issue #7893 · ray-project/ray · GitHub, seems it is doable? If so, can I have some example or guidance on how to do this?
Probably the package mpi4py is useful, but there is limited resources and discussion that can refer to on the Internet.
Any thoughts or discussions are welcome.
Thanks so much in advance and look forward to get reply!

Hmm, unfortunately this doesn’t seem quite possible without large refactoring of the Ray code.

@Kai_Huang why do you want to do this?

1 Like

Hi Richard,

Thanks for your reply!

For PyTorch RaySGD on CPU, currently only gloo backend is supported. If I want to use mpi backend, then the program should start by mpirun -n .. instead of a normal python script right (if I’m understanding correctly). So is ray easy to support mpi backend (which supports all-to-all) for pytorch in RaySGD?

Thanks so much!

No, unfortunately it’s not easy to support this. Is there a reason why you need the MPI backend?

Thanks for your reply Richard!
For models with large embeddings, probably we not only need to do data parallel but also model parallel. In this case, all-to-all is needed during the training process and for only mpi backend supports all-to-all for pytorch… So I’m thinking of doing this…

Hi @Kai_Huang is your embedding on CPU RAM or GPU memory?

(1) If it is on GPU memory we might be able to use the existing NCCL-on-Ray send/recv APIs (already supported on Ray master, see here) to assemble an all-to-all interface (that is what pytorch does), which should be easy!

(2) if it is on CPU memory, similarly, we might use the WIP GLOO-on-Ray send/recv APIs to assemble an all-to-all for the requirement. But GLOO-on-Ray is still in progress which might take a while

(3) The most difficult case is that some of your embeddings is on CPU ram whereas others are on GPU memory, we might need more work to support his one.

Let us know…

Hi Hao,

Thanks for your reply. I’m using purely CPU and thus I suppose should be case 2.
So the WIP GLOO-on-Ray can support all-to-all? May I ask is it easy to integrate with torch.distributed (namely if I have an mpi backend implementation, will it take many efforts to switch to this)?

Thanks,
Kai

Hi @Kai_Huang : Yes the ongoing GLOO-on-Ray can support all-to-all.

If you are using PyTorch, you can choose “mpi” as the backend. You can try to set it up inside a few distributed Ray actors, and see whether it will work.
The main difficulty for Ray to support MPI is due to MPI’s hard constraint on using an mpi_exec command to spawn processes, wherare in Ray the process is managed by raylet or core_worker at a lower level. But there are chances that PyTorch re-implement the collective process rendevous procedure, see here. Just try each of them when choosing MPI as the backend and see whether it would work.

If you’d like to, you can describe your use case (about embedding graidents?) in more detail here so I might help figure out other workarounds for you before the GLOO-on-Ray. In general in distributed ML all-to-all is not a must (there are alternative)…

Hi Hao,

Thanks for your reply. I have already tried init_process_group within ray actors with mpi backend and it doesn’t work… I suppose the processes should be started with mpiexec instead of simply setting some environment variables within a Ray actor.
We are working on the e2e distributed training for DLRM and there might be many very large embeddings for the model and would cause the syncing weights a bottle neck. Do you have any ideas how to improve on this without all-to-all in TorchTrainer?

Thanks,
Kai