Currentlly I am using FastAPI to serve models.
The architecture is pretty simple.
When the FastAPI app starts up, it creates model objects by looking for all models marked as active in a config file.
Each model object is basically a class containing methods for loading the model and performing inference.
The model object is stored in a map using a string (a unique model identifier).
During inference, the model id is passed to lookup the model object from the map and the infer method corresponding to the model is called.
Now in order to scale this, we want to convert this application to leverage ray/ ray serve.
What would be the best way to do this?
Do we convert all our model classes to serve.deployments?
We also have certain use-cases where they are two individual sklearn models being run sequentially inside a single model object. We want to convert this into a parallel workflow using ray dags.
Kindly comment on the best way to perform this migration
There’s a couple options–
- You could make each model its own standalone deployment. This may be ideal if you want to pack many multiple replicas into a few nodes but still run all of them simultaneously.
- You could use model multiplexing. Essentially, each deployment replica can load and run some of the models at any given time. As requests come in, Serve will first attempt to route them to replicas that already have their model loaded. This is a good option if you have many models that don’t all need to be running at the same time.
Hey Shreyas. Thanks for the reply.
Let me give you some more context. We are restricted right now to only the head node. So currently we can only utilise ray serve in a single node setting. The model code is written by data scientists and pushed to Github. Currently we push the code to a Code artifact repository and download it during runtime using a poller. How can we do that in ray serve?
The endpoint which we want per model should of the following form /predict/ModelName/ModelVersion
And we need this endpoint created dynamically when the model inference pipeline is loaded to memory. Now what do I mean by a model pipeline.
A model pipeline could be something like a RAG or an object detection model with certain post processing steps. In order to standardize what our model pipeline should look like we have provided clients with an abstract class having methods like infer, post_process. The object of this class (with the model loaded) contains the method to infer and is used in production.
How can we adopt a similar design and deploy using serve ?
There’s a few different design options here.
Updating the model code
- Ray Serve offers a
reconfigure
method to dynamically update deployment replicas. You could use the same poller to kick off a reconfigure
call that loads the new code. Note that this only works if you don’t need to change the deployment code itself. This is ideal for cases where you want to load new model weights but keep the surrounding code the same.
- Otherwise, you could use the poller to kick off an upgrade. It would need to create or access a new Ray Serve config and either apply that to the current Ray cluster (to perform an in-place upgrade) or apply it to another Ray cluster and shift traffic.
Dynamic Endpoints
If you want to dynamically change your Serve app’s routes, then you would need to perform either an in-place or cross-cluster upgrade. The poller would need to create a new config and then perform the upgrade.