Ray Serve - Setting num_replicas > 1 errors out and not using GPU

shamil · December 23, 2021, 9:27pm

Hi All,

I am testing out Ray Serve on a single node with CPU and 1 GPU. From what I see, Ray Serve predictions are much slower than when I do conventionally without Ray on simple FastAPI http call. I am assuming my configuration is not setup properly to utilize resources.

My first problem is whenever I set num_replicas > 1, i get a error saying:

Secondly, the Ray Serve is not utilizing GPU when I perform calculation on my model with num_replicas = 1. I know this because I monitor the GPU usage during inference. Here is my code:

@serve.deployment(ray_actor_options={“num_gpus”: 1},route_prefix="/sentiment", num_replicas = 3)
class MLflowBackend:
def init(self, model_uri):
print(‘init’)
self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

async def __call__(self, starlette_request):
    data = await starlette_request.body()
    data = data.decode("utf-8")
    df = pd.DataFrame([data])
    cats = self.model.predict(df)
    return cats['predictions'][0]

shrekris · January 3, 2022, 7:46pm

Hi @shamil, the reason Ray can’t schedule more than 1 replica at a time is because each replica requires 1 gpu based on the code’s specifications (specifically this snippet in @serve.deployment: ray_actor_options={“num_gpus”: 1}). Instead, you can specify a fractional GPU to allow multiple replicas to share your GPU (i.e. ray_actor_options={“num_gpus”: 0.33} for 3 replicas).

As for the GPU remaining unused, do you know whether your code uses a GPU without Serve? The function or class itself must use GPU resources for the GPU to be used. If the code doesn’t use a GPU without Serve, then it still won’t use it even with Serve.

shamil · January 7, 2022, 7:48pm

Hi @shrekris ,

Thanks for your help! Indeed setting fractional gpu usage for 3 replicas solved the problem. I do have a question. As I increase # of replicas, my server crashes. How do I know when is a good point to stop increasing replicas? It consumes so much memory and/or computation that other non-Ray resource that require GPU become inhibited.

shrekris · January 7, 2022, 9:43pm

No worries! Choosing the number of replicas requires a bit of experimentation. You can try different numbers for the num_replicas, and see which one works best. You could also try autoscaling, which allows you to set a min_replicas and max_replicas value. The autoscaler might provide some insight on how you could set your num_replicas to be efficient.

shamil · January 10, 2022, 7:18pm

Hi @shrekris ,

Indeed when I saw Ray dashboard, I saw memory usage = 86% with num_replicas = 3. Meaning for any more Ray actor replica, it is likely to crash.

Interestingly, my model inference time is the same even with num_replicas = 1. Would you suggest I stick to just 1 replica? I tested it out on up to 10k data points and despite a warning message, it works well. Results are 4x faster

simon-mo · January 13, 2022, 5:47pm

The memory crashed is because each replicas will move models to GPU and occupy memory there. The description sounds like your model inference time is very fast and I would say 2 or 3 replicas should be sufficient.

The correct number of replicas really depends on whether application is CPU bound, memory bound or GPU bound. If each replica can utilize 100% of GPU compute power but use little CPU, then there is no point of increase number of replicas.

Topic		Replies	Views
Not sure how num_replicas works Ray Serve	5	1711	March 4, 2021
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	850	October 20, 2023
Model replication with multiple GPU deployments Ray Serve	4	1391	August 16, 2022
Serve the same model replicas on the same GPU Ray Serve	0	115	May 23, 2024
Ray Serve Model Worker Replicas Created But GPU Usage is 0% during Inference Ray Serve	7	964	January 19, 2022

Ray Serve - Setting num_replicas > 1 errors out and not using GPU

Related topics