Help designing fire and forget server for large batch inference

jonaz · November 29, 2023, 5:51pm

I’ve trying to adapt my service to use Ray Workflow instead, but I’m running into some issues – getting OOMs due to memory not being released by Ray::IDLE processes (I posted the issue here). Besides that, I still have the issue that each invocation of the workflow needs to load the model again from scratch, which is quite wasteful, since I always need the same model.

So, I’m still interested in my original questions from the first post;

Is there a risk of “losing” work by doing deployment.remote() but not awaiting its result for example?
How large is the internal queue receiving requests when doing deployment.remote()? Is there a risk of it dropping requests?

Topic		Replies	Views
Workflow calling Deployment.remote()? Ray Workflows	0	328	November 28, 2023
Ray Serve: custom resource optimization Ray Serve	3	447	January 26, 2023
Optimal way to handle for loop with multiple await calls Ray Serve	6	888	June 22, 2022
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	440	October 20, 2023
Ray with FastAPI Ray Core	1	443	December 24, 2023

Help designing fire and forget server for large batch inference

Related Topics