Hi,
We had an architecture in mind for one of our applications, and essentially we have quite a few models which have to run in an API, we could pack them each one in a different ray serve application, but we wanted to use a ray cluster and run an application with an endpoint per model. Each model requires some other preprocessing models, HuggingFace transformers and some gensim embedding. What we had done was to create a deployment per each model and a deployment that exposes an API via the integration with FAstAPI. Our problem, however, has been not only the amount of RAM and CPU…which OK is expected, and some autoscaling issues (again expected), but the big problem was that some of the actors and deployments were never started, even with the required amount of resources in the cluster. At some point the system would become unresponsive and would not schedule everything.
To understand better our idea, instead of deploying some services and scale each service via kubernetes, our idea was to scale through ray, so each model would be a dpeloyment instead of a specific separate service.
Do you guys think that this is something that should not be attempted? Is it something out of scope for ray serve and ray in general?
Thanks for your opinion on that