I’m trying to deploy a few models using Ray Serve and Ray cluster on AWS. From what I’ve figured, creating a backend and endpoint using Ray Serve requires explicitly setting the resources that I need. So my questions are:
- I would like to autoscale the number of model backend replicas, and thus the resources/instances, depending on how much demand there is. Ideally the scaled replicas should be reading from the same worker queue. Is that possible?
- If I use ray clusters, is it possible to assign a specific model backend to a specific type of node, so they are scaled separately? If yes, how do I specify this? Eg. I would like model A to be scaled up only when the demand is high for model A, and model B to be scaled up when demand is high for B.
The current design is that there are 3 ray backends, general processing, model A and model B such that the http requests go to general processing before it is sent to either A or B to do inference.
Apologies if I’m missing out on any obvious facts, I’m quite new to Ray and deployment ops in general. Any advice would be appreciated, thanks!