Help designing fire and forget server for large batch inference

I’ve trying to adapt my service to use Ray Workflow instead, but I’m running into some issues – getting OOMs due to memory not being released by Ray::IDLE processes (I posted the issue here). Besides that, I still have the issue that each invocation of the workflow needs to load the model again from scratch, which is quite wasteful, since I always need the same model.

So, I’m still interested in my original questions from the first post;

  1. Is there a risk of “losing” work by doing deployment.remote() but not awaiting its result for example?

  2. How large is the internal queue receiving requests when doing deployment.remote()? Is there a risk of it dropping requests?