[Ray Serve] how to serve large models?

I’m trying to serve a composition model with a large size of weights stored in numpy arrays. I ran into this issue which may be a bug: [Serve] ValueError: Message ray.serve.ReplicaConfig exceeds maximum protobuf size of 2GB · Issue #32049 · ray-project/ray · GitHub

ValueError: Message ray.serve.ReplicaConfig exceeds maximum protobuf size of 2GB: 3200001162

In any case, how do you solve this if your weights are really big?

cc: @Sihan_Wang for thoughts

The 2 GB limit seems odd to Ray project philosophy and goals. It’s not uncommon in the field of NLP to serve models with large weights, and 2GB isn’t really that much.

Maybe I’m missing something? But the project docs states that the serve/actor state “can have a very large neural network weight.”

One way to solve this problem is to start the actor on the cluster containing the model (and one more process that continuously train weights) and then pass a handle of the actor to the serve deployment. It’s not exactly the right way, but it works and can be organised into separate steps to prevent large weights from being saved on external storage.

You might want to consider storing the model in Object Store and have the actors load from it.
This talk speaks to that scheme: Ray Summit 2022 - Agenda and equivalent blog How to Load PyTorch Models 340 Times Faster with Ray | by Fred Reiss | IBM Data Science in Practice | Medium

Also, we did a Ray meetup talk comparing different schemes to load and serve large models. See if that help any bit.

Efforts and discussions are underway to specifically deal with LLM for Ray Serve.

1 Like

@Sihan_Wang @cindy_zhang can we get a section on the Serve documentation about best practices here?