Optimizing Real-Time ML Model Serving with Ray Serve on AWS GPU Cluster: Best Practices and Resource Allocation Strategies

jaki · April 18, 2024, 8:32am

Hi team !

I am looking to utilize Ray Serve for deploying a machine learning model (30GB in size) efficiently. My infrastructure consists of AWS GPUs within a Ray cluster, and I aim to serve model predictions on demand, ideally in real-time while minimizing cost. To achieve this, I seek a strategy to load the model once, cache its parameters, and subsequently serve incoming requests rapidly.

Additionally, I’m interested in allocating approximately 20% of the GPU resources for serving purposes, while utilizing the remaining capacity for training tasks. What would be the best approach to accomplish this?

Topic		Replies	Views
Guidance Needed: Best Practices for Scaling Ray Workloads in a Hybrid Cluster Setup	0	14	May 28, 2025
Resources allocation during serve deployment Ray Serve	5	667	December 3, 2022
Minimizing loading time - using GPUs Ray Clusters	0	355	May 23, 2023
About the Ray Serve category Ray Serve	0	798	November 17, 2020
Has anyone tried Ray Serve with NVIDIA MPS Ray Serve	1	697	March 13, 2024

Optimizing Real-Time ML Model Serving with Ray Serve on AWS GPU Cluster: Best Practices and Resource Allocation Strategies

Related topics