Model Parallelism in Ray

Vipul_Gupta · January 7, 2022, 6:57pm

Hi folks, it seems like Ray Train focuses on distributed training with data parallelism. I am wondering if there is a use case with model parallelism. In our specific use-case, we are training large-scale embeddings, and these typically require model parallelism due to a large embedding matrix that cannot fit in the memory of one machine.

rliaw · January 7, 2022, 7:09pm

That’s a great use case. You mentioned offline that you were looking at Pytorch Biggraph in particular?

Vipul_Gupta · January 8, 2022, 6:28am

Thanks, Richard, for the quick response. Yes, in particular, we are looking to train node embeddings on large graphs, and use Pytorch-BigGraph as a framework for training over a Ray cluster.

rliaw · January 11, 2022, 3:29am

Officially, we don’t have any pre-existing examples. However, it should work fine (given that Ray Train just constructs the process group for you).

We would be happy to help guide you through the implementation, if you have any particular questions.

rliaw · January 11, 2022, 3:29am

(and also subsequently highlight your use case as a successful example down the road!)

Vipul_Gupta · January 13, 2022, 7:16pm

Of course! Would be happy to contribute to the ray ecosystem in any which way.

The fact that Ray Train just constructs the process group makes sense. However, the examples provided in the documentation give an impression that the API is limited and works mostly for the data parallel case. After decoding PyTorch-BigGraph, I think I can train with model parallelism on Ray if only I can replace init_process_group from torch.distributed by an equivalent ray function. Is there a similar API in Ray? Thanks!

rliaw · January 15, 2022, 5:13am

Can you post an example of what you want to do?

jamjambles · October 28, 2023, 3:23am

Hi all,

Coming across a similar use case where we want to use Ray Train to split a large model across multiple GPUs rather than replicate (data parallel).

For example I have a cluster with A10 GPUs (24GB) but the model requires ~50GB of GPU memory. Can I use Ray Train APIs to partition the model across multiple GPUs? Are there any examples?

Thanks!

justinvyu · November 2, 2023, 9:27pm

Hey @jamjambles,

The easiest way would be to use deepspeed with a Zero-1/2/3 sharding strategy using Ray Train. See this user guide for more details: Get Started with DeepSpeed — Ray 2.7.1

Let me know if you’re able to set that up. I’m also happy to move to a slack conversation to discuss further!

Magic_Liu_Liu · November 18, 2023, 3:49pm

@rliaw can you help me, i just wanna use ray to serve llama2-70b in 2 a10-4gpus vms.

Topic		Replies	Views
torch.nn.DataParallel with tune.run() Ray Tune	1	772	June 28, 2022
How to use BERT in ray cluster? Ray Clusters	1	703	April 20, 2021
Ray multiprocessing together with distributed learning Ray Train	1	563	March 2, 2022
How to distribute a very huge FC layer? Ray Core	1	262	July 12, 2021
Several questions about DL training (e.g. alexnet with pytorch) Ray Core	2	325	July 12, 2021

Model Parallelism in Ray

Related topics