Help designing fire and forget server for large batch inference

jonaz · November 27, 2023, 2:25pm

How severe does this issue affect your experience of using Ray?
Not really an issue, more of a design question.

Hello all! I’m pretty new to the Ray ecosystem, so I thought I’d get some guidelines here before building something and messing up.

Use case
I want to build a server that will receive several items to run inference with a large model. Because of this, the inference time might be very long (between minutes and hours), so I don’t want to let the client waiting for a response. Instead, the server should receive the request, immediately send a reply (thus the “fire and forget”) and queue the work. The inference results will be stored in some DB afterwards.

This is a draft of what I have so far:

@serve.deployment
class InferenceDeployment:
    def __init__(self):
        self.model = load_large_model()

    async def __call__(self, items) -> None:
        print(f"Starting inference of {len(items)} items.")
        predictions = self.model.predict(items)
        print(f"Done inference: {predictions=}")
        # Store results in DB
        return None


@serve.deployment
class IngressDeployment:
    def __init__(self, inference_handle):
        self.inference_handle = inference_handle.options(use_new_handle_api=True)

    async def __call__(self, http_request: Request) -> Dict:
        items = await http_request.json()
        self.inference_handle.remote(items)
        return {"success": True}


inference_deployment = InferenceDeployment.bind()
app = IngressDeployment.bind(inference_deployment)

Couple of questions with this initial design

I figured that by doing something like deployment_handle.remote(items) but not awaiting on its result, Ray should queue the work and not block the ingress deployment.
Is this the case, or is there a risk of “losing” work by not retrieving its result for example?
The ingress deployment will process requests a lot faster than the inference one, and most likely I won’t have enough inference replicas to process everything at the same time. So if I understand correctly, each call to deployment.remote(items) will add the work to the deployment queue.
But how large is this queue? Is there a way of at least ensuring that the work was successfully added in that queue? And what would happen to that queue in case of a crash of the application for example? How can I make it resilient?
I need to guarantee that each request ACKed by the ingress deployment will eventually get processed, doesn’t matter when, but I can’t lose jobs basically.

Hope these questions make sense, and thanks for reading this far!

jonaz · November 28, 2023, 10:23am

Well, after posting this here I did some digging and found this thread Job Queue in Ray? - #24 by ericl

Which led me to these RFCs

[RFC] Job queueing functionality with Ray Serve + Workflows · Issue #21161 · ray-project/ray · GitHub
[RFC] Async request support in Ray Serve · Issue #32292 · ray-project/ray · GitHub

And to discovering the existence of Ray Workflows, which could cover my use cases I believe

But anyway, I’d appreciate guidelines on how to use Workflows to design something like that, if anybody has some tips or best practices. I also know it’s still in Alpha.

Excited to see the development of the Async requests on Serve as well!
What I’d suggest is to make Ray Workflows more visible, it took me quite a while to find out this existed, and I think use cases like mine are becoming more and more common (as the second RFC suggests)

jonaz · November 29, 2023, 5:51pm

I’ve trying to adapt my service to use Ray Workflow instead, but I’m running into some issues – getting OOMs due to memory not being released by Ray::IDLE processes (I posted the issue here). Besides that, I still have the issue that each invocation of the workflow needs to load the model again from scratch, which is quite wasteful, since I always need the same model.

So, I’m still interested in my original questions from the first post;

Is there a risk of “losing” work by doing deployment.remote() but not awaiting its result for example?
How large is the internal queue receiving requests when doing deployment.remote()? Is there a risk of it dropping requests?

Jules_Damji · November 29, 2023, 6:05pm

@jonaz Just be mindful that Ray Workflow is very experimental. So I would eschew from using it for now. Instead, try with Async batch inference first.

cc: @Gene @Sihan_Wang @Akshay_Malik FYI

jonaz · November 29, 2023, 6:18pm

try with Async batch inference first.

By this you mean, the approach I was proposing in my first post above? Or is there some built-in Ray functionality for this?

eoakes · November 29, 2023, 7:33pm

@jonaz thanks for the detailed description and questions. As you can see from the RFC links above, we’ve also seen interest from others in the community on this topic.

As it stands, we don’t have very mature support for this type of asynchronous/queuing workflow in Ray. As you’ve discovered, there are roughly two patterns for doing this entirely in Ray:

Responding immediately after queuing requests in the Ray Serve DeploymentHandle. To answer your question from above, there is no automatic persistence/fault tolerance here. So if the process that’s queueing the request fails for any reason, the request will be lost. It sounds like this is not acceptable for your use case.
Using Ray Workflows. To avoid the issue you described above, you could integrate this with Ray Serve. That is, instead of directly performing the inference in-process, you could make a call to a Ray Serve deployment using the DeploymentHandle from the workflow step.

If you have strong persistence requirements, my recommendation for the most production-ready setup you could have is to integrate with an external queueing system (e.g., AWS SQS). One way to structure this is with multiple Ray Serve applications – one that accepts requests and places them on the external queue, then another that repeatedly polls for work from the queue and performs inference on it (perhaps using a downstream DeploymentHandle call like you have it currently set up).

Hopefully this helps to frame the possible solutions & tradeoffs. Let me know if you have any questions. I can also try to find a code sample for the external queueing solution if you are interested in it.

Jules_Damji · November 29, 2023, 10:00pm

Thanks @eoakes for lending your suggestions.

jonaz · November 30, 2023, 10:00am

Thanks @eoakes for you detailed answers!

Regarding this, I think I actually tried that but with no success. For some reason, I didn’t manage to make a deployment_handle.remote() call from inside a Workflow task/step – probably I’m using the API wrong at some point. I actually made another thread about that here Workflow calling Deployment.remote()?, maybe you could check it out? I’d really appreciate it

Topic		Replies	Views
Workflow calling Deployment.remote()? Ray Workflows	0	360	November 28, 2023
Serve batching not working	0	192	December 6, 2023
Ray Serve: custom resource optimization Ray Serve	3	471	January 26, 2023
[Serve] Batch inference	1	411	December 6, 2023
Optimal way to handle for loop with multiple await calls Ray Serve	6	1014	June 22, 2022

Help designing fire and forget server for large batch inference

Related topics