Any way to launch mpi programmatically within ray?

Kai_Huang · January 19, 2021, 12:43pm

Hi everyone,

Want to ask is there any way to launch mpi within ray actors in Python code, namely can the ray actor processes form an mpi communicator without using mpiexec/mpirun for the launch?
I find some related issues: [rllib][Ray] Running MPI compliant software as a custom environment together with rllib · Issue #10975 · ray-project/ray · GitHub
and Ray initialization in MPI environment · Issue #7893 · ray-project/ray · GitHub, seems it is doable? If so, can I have some example or guidance on how to do this?
Probably the package mpi4py is useful, but there is limited resources and discussion that can refer to on the Internet.
Any thoughts or discussions are welcome.
Thanks so much in advance and look forward to get reply!

rliaw · January 20, 2021, 11:12pm

Hmm, unfortunately this doesn’t seem quite possible without large refactoring of the Ray code.

@Kai_Huang why do you want to do this?

Kai_Huang · January 21, 2021, 1:57am

Hi Richard,

Thanks for your reply!

For PyTorch RaySGD on CPU, currently only gloo backend is supported. If I want to use mpi backend, then the program should start by mpirun -n .. instead of a normal python script right (if I’m understanding correctly). So is ray easy to support mpi backend (which supports all-to-all) for pytorch in RaySGD?

Thanks so much!

rliaw · January 24, 2021, 8:19am

No, unfortunately it’s not easy to support this. Is there a reason why you need the MPI backend?

Kai_Huang · January 25, 2021, 2:21am

Thanks for your reply Richard!
For models with large embeddings, probably we not only need to do data parallel but also model parallel. In this case, all-to-all is needed during the training process and for only mpi backend supports all-to-all for pytorch… So I’m thinking of doing this…

zhisbug · January 25, 2021, 8:28am

Hi @Kai_Huang is your embedding on CPU RAM or GPU memory?

(1) If it is on GPU memory we might be able to use the existing NCCL-on-Ray send/recv APIs (already supported on Ray master, see here) to assemble an all-to-all interface (that is what pytorch does), which should be easy!

(2) if it is on CPU memory, similarly, we might use the WIP GLOO-on-Ray send/recv APIs to assemble an all-to-all for the requirement. But GLOO-on-Ray is still in progress which might take a while

(3) The most difficult case is that some of your embeddings is on CPU ram whereas others are on GPU memory, we might need more work to support his one.

Let us know…

Kai_Huang · January 26, 2021, 10:30am

Hi Hao,

Thanks for your reply. I’m using purely CPU and thus I suppose should be case 2.
So the WIP GLOO-on-Ray can support all-to-all? May I ask is it easy to integrate with torch.distributed (namely if I have an mpi backend implementation, will it take many efforts to switch to this)?

Thanks,
Kai

zhisbug · January 26, 2021, 4:48pm

Hi @Kai_Huang : Yes the ongoing GLOO-on-Ray can support all-to-all.

If you are using PyTorch, you can choose “mpi” as the backend. You can try to set it up inside a few distributed Ray actors, and see whether it will work.
The main difficulty for Ray to support MPI is due to MPI’s hard constraint on using an mpi_exec command to spawn processes, wherare in Ray the process is managed by raylet or core_worker at a lower level. But there are chances that PyTorch re-implement the collective process rendevous procedure, see here. Just try each of them when choosing MPI as the backend and see whether it would work.

If you’d like to, you can describe your use case (about embedding graidents?) in more detail here so I might help figure out other workarounds for you before the GLOO-on-Ray. In general in distributed ML all-to-all is not a must (there are alternative)…

Kai_Huang · January 27, 2021, 10:42am

Hi Hao,

Thanks for your reply. I have already tried init_process_group within ray actors with mpi backend and it doesn’t work… I suppose the processes should be started with mpiexec instead of simply setting some environment variables within a Ray actor.
We are working on the e2e distributed training for DLRM and there might be many very large embeddings for the model and would cause the syncing weights a bottle neck. Do you have any ideas how to improve on this without all-to-all in TorchTrainer?

Thanks,
Kai

Topic		Replies	Views
How to launch multi-node job with Ray Train? Ray Train	9	2085	June 14, 2024
How does RaySGD work on top of torch.dist.launch? Ray Tune	3	510	June 16, 2021
[Ray] How to implement distributed DDP in pytorch using only pytorch And ray? Ray Tune	1	833	July 28, 2021
Does Ray support multi node message passing like MPI? If so does it support HPC schedulers like Slurm or PBS? Ray Core	0	366	August 23, 2021
Distributed torch model training with Ray Core APIs Ray Core	3	527	November 3, 2023

Any way to launch mpi programmatically within ray?

Related topics