Context
Hello, we are quite new to Ray Serve but we are interested in the FastAPI integration. We would need some clarification regarding the recommended approach.
Let’s say we have 2 entry points (1 via http(online), 1 via python function(offline)) per use case, each have their own decoding and encoding logic but the business logic of calling the model etc is the same (it’s being shared). We could have something like this :
Approach 1
Web endpoint
@router.post("/predict")
async def predict_online(image: UploadFile = File(...)):
# decode online request format
# call business logic
# call model A using ServeHandle
# call model B using ServeHandle
# ...
# Cal model N using ServeHandle
# return web response (eg: json)
Offline entrypoint (python function)
async def predict_offline(image: Bytes):
# decode offline request format
# call business logic
# Call model A using ServeHandle
# Call model B using ServeHandle
# ...
# Cal model N using ServeHandle
# return response formatted for offline use
Models
Now for the models, our intuition is that we could have 1 @serve.deployment
decorated class per Model ex:
@serve.deployment(ray_actor_options={"num_gpus": 0.7})
class ModelADeployment:
...
async def __call__(self, inputs):
return self.model(inputs)
(same principle for all N models, each model could have their own GPU requirements eg: 0.7 for model A, 0.3 for model B etc). Let’s imagine these models are deployed in some main or startup function of the application.
Then we could call our models in the appropriate function (call business logic part from earlier) using python ServeHandle the same way inside the 2 entry points since they share some business logic where the model call is performed.
Deployment
For deployment we could call .deploy()
on the appropriate model when a new model gets pushed to the model store (eg: if a new version of Model C is published, then we can update the version of ModelCDeployment
via its .options(version=new_version_number)
then call ModelCDeployment.deploy()
.
This Approach 1 was what we had in mind after looking at the Model Composition doc but we weren’t sure if it was what is recommended after looking at the documentation for the FastAPI integration (eg: we should use the ingress decorator).
We would like to know if it’s in-lined with Ray Serve recommendations, let’s call this Approach 1 for the sake of this discussion (so no ingress decorator for this approach).
Approach 2
Let’s consider another approach that makes use of the ingress
decorator. This approach is identical for the deployment of the new model versions (using .deploy()
on the specific model that changed) but is different in where the entry points reside.
We can see that the documentation recommends the use of @seve.ingress(app)
, we can also see that it has its own @serve.deployment
annotation.
Example from the documentation :
@serve.deployment(route_prefix="/hello")
@serve.ingress(app)
class MyFastAPIDeployment:
@app.get("/")
def root(self):
return "Hello, world!"
Excuse our maybe naïve understanding but we were wondering, what is the point of having a dedicated @serve.deployment for the endpoint (here MyFastAPIDeployment
) since in some cases we can have 1 endpoint using multiple models where each model would scale differently (have different GPU requirements in their @serve.deployment
).
Does having @serve.deployment on MyFastAPIDeployment
mean we can no longer scale each models used inside MyFastAPIDeployment
independently and that we would need to always use MyFastAPIDeployment
to deploy a new model (eg: MyFastAPIDeployment.deploy()
) and that we would need to merge all requirements (eg: ModelA 0.7 GPU + ModelB 0.3 GPU = 1.0 GPU) into the MyFastAPIDeployment
@serve.deployment
annotation or is it still fine to have 1 deployment class per model as well and simply call them in the following way:
@serve.deployment(route_prefix="/road-classification")
@serve.ingress(routet)
class RoadClassificationDeployment:
@router.post("/predict")
async def predict_online(image: UploadFile = File(...)):
# decode online request format
# call business logic
# call model A using ServeHandle
# call model B using ServeHandle
# ...
# Cal model N using ServeHandle
# return web response (eg: json)
async def predict_offline(image: Bytes):
# decode offline request format
# call business logic
# Call model A using ServeHandle
# Call model B using ServeHandle
# ...
# Cal model N using ServeHandle
# return response formatted for offline use
And when another use case comes in we could create another class decorated with ingress, for instance:
@serve.deployment(route_prefix="/landing-spot")
@serve.ingress(router)
class LandingSpotSegmentationDeployment:
@router.post("/predict")
async def predict_online(image: UploadFile = File(...)):
# decode online request format
# call business logic
# call model 1 using ServeHandle
# call model 2 using ServeHandle
# return web response (eg: json)
async def predict_offline(image: Bytes):
# decode offline request format
# call business logic
# Call model 1 using ServeHandle
# Call model 2 using ServeHandle
# return response formatted for offline use
So the only change is that we moved the 2 entry points inside a class annotated with ingress
. This way the offline code can still get a ServeHandle to call predict_offline
and http requests would be routed to predict_online
.
Conclusion
Please let us know if Approach 1 or Approach 2 is more in-line with what is recommended or if both are completely wrong.
Lastly, is the @serve.deployment decorator from MyFastAPIDeployment simply there to provide a route_prefix
and does not really change anything in regards to resources requirements since no options is specified?
Thanks