[Ray Serve] how to serve large models?

jamm1985 · January 31, 2023, 10:10am

I’m trying to serve a composition model with a large size of weights stored in numpy arrays. I ran into this issue which may be a bug: [Serve] ValueError: Message ray.serve.ReplicaConfig exceeds maximum protobuf size of 2GB · Issue #32049 · ray-project/ray · GitHub

ValueError: Message ray.serve.ReplicaConfig exceeds maximum protobuf size of 2GB: 3200001162

In any case, how do you solve this if your weights are really big?

Huaiwei_Sun · February 1, 2023, 2:56am

cc: @Sihan_Wang for thoughts

jamm1985 · February 1, 2023, 12:07pm

The 2 GB limit seems odd to Ray project philosophy and goals. It’s not uncommon in the field of NLP to serve models with large weights, and 2GB isn’t really that much.

Maybe I’m missing something? But the project docs states that the serve/actor state “can have a very large neural network weight.”

jamm1985 · February 1, 2023, 2:44pm

One way to solve this problem is to start the actor on the cluster containing the model (and one more process that continuously train weights) and then pass a handle of the actor to the serve deployment. It’s not exactly the right way, but it works and can be organised into separate steps to prevent large weights from being saved on external storage.

Jules_Damji · February 7, 2023, 5:43pm

You might want to consider storing the model in Object Store and have the actors load from it.
This talk speaks to that scheme: Ray Summit 2022 - Agenda and equivalent blog How to Load PyTorch Models 340 Times Faster with Ray | by Fred Reiss | IBM Data Science in Practice | Medium

Also, we did a Ray meetup talk comparing different schemes to load and serve large models. See if that help any bit.

Efforts and discussions are underway to specifically deal with LLM for Ray Serve.

rliaw · February 8, 2023, 1:53am

@Sihan_Wang @cindy_zhang can we get a section on the Serve documentation about best practices here?

b-gran · March 12, 2024, 7:16pm

Hi @rliaw, are there any updated best practices? I am running into similar issues when trying to deploy a 10GB model.

Topic		Replies	Views
Ray Serve Replica taking a lot of memory before requests even come in Ray Serve	3	500	September 29, 2021
Resources allocation during serve deployment Ray Serve	5	660	December 3, 2022
Want advice on Improving Ray for Long Machine Learning Model Training	1	59	July 13, 2024
Accessing Large Static Datasets with Ray Clusters	3	519	May 27, 2023
Using ray.put for LARGE numpy arrays Ray Core	12	1435	July 27, 2023

[Ray Serve] how to serve large models?

Related topics