Ray Serve FastAPI Recommended Approach

Context

Hello, we are quite new to Ray Serve but we are interested in the FastAPI integration. We would need some clarification regarding the recommended approach.

Let’s say we have 2 entry points (1 via http(online), 1 via python function(offline)) per use case, each have their own decoding and encoding logic but the business logic of calling the model etc is the same (it’s being shared). We could have something like this :

Approach 1

Web endpoint

@router.post("/predict")
async def predict_online(image: UploadFile = File(...)):
    # decode online request format
    # call business logic
        # call model A using ServeHandle
        # call model B using ServeHandle
        # ... 
        # Cal model N using ServeHandle
    # return web response (eg: json)

Offline entrypoint (python function)

async def predict_offline(image: Bytes):
    # decode offline request format
    # call business logic
        # Call model A using ServeHandle
        # Call model B using ServeHandle
        # ... 
        # Cal model N using ServeHandle
    # return response formatted for offline use

Models

Now for the models, our intuition is that we could have 1 @serve.deployment decorated class per Model ex:

@serve.deployment(ray_actor_options={"num_gpus": 0.7})
class ModelADeployment:
    ...
    async def __call__(self, inputs):
        return self.model(inputs)

(same principle for all N models, each model could have their own GPU requirements eg: 0.7 for model A, 0.3 for model B etc). Let’s imagine these models are deployed in some main or startup function of the application.

Then we could call our models in the appropriate function (call business logic part from earlier) using python ServeHandle the same way inside the 2 entry points since they share some business logic where the model call is performed.

Deployment

For deployment we could call .deploy() on the appropriate model when a new model gets pushed to the model store (eg: if a new version of Model C is published, then we can update the version of ModelCDeployment via its .options(version=new_version_number) then call ModelCDeployment.deploy().

This Approach 1 was what we had in mind after looking at the Model Composition doc but we weren’t sure if it was what is recommended after looking at the documentation for the FastAPI integration (eg: we should use the ingress decorator).

We would like to know if it’s in-lined with Ray Serve recommendations, let’s call this Approach 1 for the sake of this discussion (so no ingress decorator for this approach).

Approach 2

Let’s consider another approach that makes use of the ingress decorator. This approach is identical for the deployment of the new model versions (using .deploy() on the specific model that changed) but is different in where the entry points reside.

We can see that the documentation recommends the use of @seve.ingress(app) , we can also see that it has its own @serve.deployment annotation.

Example from the documentation :

@serve.deployment(route_prefix="/hello")
@serve.ingress(app)
class MyFastAPIDeployment:
    @app.get("/")
    def root(self):
        return "Hello, world!"

Excuse our maybe naïve understanding but we were wondering, what is the point of having a dedicated @serve.deployment for the endpoint (here MyFastAPIDeployment) since in some cases we can have 1 endpoint using multiple models where each model would scale differently (have different GPU requirements in their @serve.deployment).

Does having @serve.deployment on MyFastAPIDeployment mean we can no longer scale each models used inside MyFastAPIDeployment independently and that we would need to always use MyFastAPIDeployment to deploy a new model (eg: MyFastAPIDeployment.deploy()) and that we would need to merge all requirements (eg: ModelA 0.7 GPU + ModelB 0.3 GPU = 1.0 GPU) into the MyFastAPIDeployment @serve.deployment annotation or is it still fine to have 1 deployment class per model as well and simply call them in the following way:

@serve.deployment(route_prefix="/road-classification")
@serve.ingress(routet)
class RoadClassificationDeployment:
    
    @router.post("/predict")
    async def predict_online(image: UploadFile = File(...)):
        # decode online request format
        # call business logic
            # call model A using ServeHandle
            # call model B using ServeHandle
            # ... 
            # Cal model N using ServeHandle
        # return web response (eg: json)

    async def predict_offline(image: Bytes):
    # decode offline request format
    # call business logic
        # Call model A using ServeHandle
        # Call model B using ServeHandle
        # ... 
        # Cal model N using ServeHandle
    # return response formatted for offline use

And when another use case comes in we could create another class decorated with ingress, for instance:

@serve.deployment(route_prefix="/landing-spot")
@serve.ingress(router)
class LandingSpotSegmentationDeployment:
    
    @router.post("/predict")
    async def predict_online(image: UploadFile = File(...)):
        # decode online request format
        # call business logic
            # call model 1 using ServeHandle
            # call model 2 using ServeHandle
        # return web response (eg: json)
   
    async def predict_offline(image: Bytes):
       # decode offline request format
       # call business logic
           # Call model 1 using ServeHandle
           # Call model 2 using ServeHandle
       # return response formatted for offline use

So the only change is that we moved the 2 entry points inside a class annotated with ingress. This way the offline code can still get a ServeHandle to call predict_offline and http requests would be routed to predict_online.

Conclusion

Please let us know if Approach 1 or Approach 2 is more in-line with what is recommended or if both are completely wrong.

Lastly, is the @serve.deployment decorator from MyFastAPIDeployment simply there to provide a route_prefix and does not really change anything in regards to resources requirements since no options is specified?

Thanks

To summarize my understanding of your question:

  • Approach 1 only uses ServeHandle and calls the model deployments from web endpoint or python function.
  • Approach 2 is to consolidate the code that calls biz logic, serve handle calls, and response marshalling into a wrapper Serve deployment.

@serve.ingress(fastapi_app) is recommended when:

  • You want to bring existing FastAPI to be managed and scaled out by Serve
  • You want the HTTP features from FastAPI when processing web requests with Serve (without it you are directly working with lower level starlette.Request).

For your use case, I would recommend an architecture similar to approach 2. Your RoadClassificationDeployment mostly just orchestrate the requests and does not occupy any compute. You can also abstract out the shared code into class method.

@serve.deployment(route_prefix="/road-classification")
@serve.ingress(router)
class RoadClassificationDeployment:

    @router.post("/predict")
    async def predict_online(image: UploadFile = File(...)):
        # decode online request format
	    self.common()
        # return web response (eg: json)

    async def predict_offline(image: Bytes):
	    # decode offline request format
	    self.common()
	    # return response formatted for offline use

    def common(self):
        # call business logic
        # call model A using ServeHandle
        # call model B using ServeHandle
        # ... 
        # Cal model N using ServeHandle

This approach have several advantages compared to approach 1:

  • Easier to share code and reason about the online and offline code path.
  • The FastAPI app is now managed by Serve, you can upgrade it and you don’t need to run a separate uvicorn run my_app:app in addtion to Ray Serve.
  • When you are calling it offline, instead of directly calling the mode handles, you will now be calling this wrapper deployment that orchestrates things within the Ray cluster. Overall it reduces the communication between your offline client and the Ray Serve cluster.
handle = RoadClassificationDeployment.get_handle()
ray.get([handle.predict_offline(row) for row in dataset])

Lastly, if you want to expand this to LandingSpotSegmentation use case, you can either directly add it to the same class or use OOP technique to inherit a base class.

Addressing some of your question

Does having @serve.deployment on MyFastAPIDeployment mean we can no longer scale each models used inside MyFastAPIDeployment independently and that we would need to always use MyFastAPIDeployment to deploy a new model (eg: MyFastAPIDeployment.deploy()) and that we would need to merge all requirements (eg: ModelA 0.7 GPU + ModelB 0.3 GPU = 1.0 GPU) into the MyFastAPIDeployment @serve.deployment annotation or is it still fine to have 1 deployment class per model as well and simply call them in the following way…

Because the MyFastAPIDeployment and your model deployments are different deployments, you actually specify the resource requirements and deploy the separately. It is just sepcial that MyFastAPIDeployment calls the model deployments via ServeHandle. You don’t need to merge resource requirements at all.

Lastly, is the @serve.deployment decorator from MyFastAPIDeployment simply there to provide a route_prefix and does not really change anything in regards to resources requirements since no options is specified?

Correct, in order to change the resource requirement, you would have to specify it inside @serve.deployment(ray_actor_options={"num_cpus": 0.5"}). The route_prefix is only used for routing purpose.