Hi team !
I am looking to utilize Ray Serve for deploying a machine learning model (30GB in size) efficiently. My infrastructure consists of AWS GPUs within a Ray cluster, and I aim to serve model predictions on demand, ideally in real-time while minimizing cost. To achieve this, I seek a strategy to load the model once, cache its parameters, and subsequently serve incoming requests rapidly.
Additionally, I’m interested in allocating approximately 20% of the GPU resources for serving purposes, while utilizing the remaining capacity for training tasks. What would be the best approach to accomplish this?