Optimizing Real-Time ML Model Serving with Ray Serve on AWS GPU Cluster: Best Practices and Resource Allocation Strategies

Hi team !

I am looking to utilize Ray Serve for deploying a machine learning model (30GB in size) efficiently. My infrastructure consists of AWS GPUs within a Ray cluster, and I aim to serve model predictions on demand, ideally in real-time while minimizing cost. To achieve this, I seek a strategy to load the model once, cache its parameters, and subsequently serve incoming requests rapidly.

Additionally, I’m interested in allocating approximately 20% of the GPU resources for serving purposes, while utilizing the remaining capacity for training tasks. What would be the best approach to accomplish this?