Designing Help: Convert fastapi application to ray serve

Currentlly I am using FastAPI to serve models.
The architecture is pretty simple.
When the FastAPI app starts up, it creates model objects by looking for all models marked as active in a config file.
Each model object is basically a class containing methods for loading the model and performing inference.

The model object is stored in a map using a string (a unique model identifier).
During inference, the model id is passed to lookup the model object from the map and the infer method corresponding to the model is called.

Now in order to scale this, we want to convert this application to leverage ray/ ray serve.

What would be the best way to do this?
Do we convert all our model classes to serve.deployments?

We also have certain use-cases where they are two individual sklearn models being run sequentially inside a single model object. We want to convert this into a parallel workflow using ray dags.

Kindly comment on the best way to perform this migration

There’s a couple options–

  1. You could make each model its own standalone deployment. This may be ideal if you want to pack many multiple replicas into a few nodes but still run all of them simultaneously.
  2. You could use model multiplexing. Essentially, each deployment replica can load and run some of the models at any given time. As requests come in, Serve will first attempt to route them to replicas that already have their model loaded. This is a good option if you have many models that don’t all need to be running at the same time.