Ray Serve - Setting num_replicas > 1 errors out and not using GPU

Hi All,

I am testing out Ray Serve on a single node with CPU and 1 GPU. From what I see, Ray Serve predictions are much slower than when I do conventionally without Ray on simple FastAPI http call. I am assuming my configuration is not setup properly to utilize resources.

My first problem is whenever I set num_replicas > 1, i get a error saying:

Secondly, the Ray Serve is not utilizing GPU when I perform calculation on my model with num_replicas = 1. I know this because I monitor the GPU usage during inference. Here is my code:

@serve.deployment(ray_actor_options={“num_gpus”: 1},route_prefix="/sentiment", num_replicas = 3)
class MLflowBackend:
def init(self, model_uri):
self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

async def __call__(self, starlette_request):
    data = await starlette_request.body()
    data = data.decode("utf-8")
    df = pd.DataFrame([data])
    cats = self.model.predict(df)
    return cats['predictions'][0]

Hi @shamil, the reason Ray can’t schedule more than 1 replica at a time is because each replica requires 1 gpu based on the code’s specifications (specifically this snippet in @serve.deployment: ray_actor_options={“num_gpus”: 1}). Instead, you can specify a fractional GPU to allow multiple replicas to share your GPU (i.e. ray_actor_options={“num_gpus”: 0.33} for 3 replicas).

As for the GPU remaining unused, do you know whether your code uses a GPU without Serve? The function or class itself must use GPU resources for the GPU to be used. If the code doesn’t use a GPU without Serve, then it still won’t use it even with Serve.

Hi @shrekris ,

Thanks for your help! Indeed setting fractional gpu usage for 3 replicas solved the problem. I do have a question. As I increase # of replicas, my server crashes. How do I know when is a good point to stop increasing replicas? It consumes so much memory and/or computation that other non-Ray resource that require GPU become inhibited.

No worries! Choosing the number of replicas requires a bit of experimentation. You can try different numbers for the num_replicas, and see which one works best. You could also try autoscaling, which allows you to set a min_replicas and max_replicas value. The autoscaler might provide some insight on how you could set your num_replicas to be efficient.

Hi @shrekris ,

Indeed when I saw Ray dashboard, I saw memory usage = 86% with num_replicas = 3. Meaning for any more Ray actor replica, it is likely to crash.

Interestingly, my model inference time is the same even with num_replicas = 1. Would you suggest I stick to just 1 replica? I tested it out on up to 10k data points and despite a warning message, it works well. Results are 4x faster

The memory crashed is because each replicas will move models to GPU and occupy memory there. The description sounds like your model inference time is very fast and I would say 2 or 3 replicas should be sufficient.

The correct number of replicas really depends on whether application is CPU bound, memory bound or GPU bound. If each replica can utilize 100% of GPU compute power but use little CPU, then there is no point of increase number of replicas.