Ray Serve Model Worker Replicas Created But GPU Usage is 0% during Inference

shamil · January 11, 2022, 8:43pm

Hi All,

Up until yesterday, I was able to see Volatile GPU-Util spike up during inference on a model utilizing the 3 replicas I made. However, for no real reason, all a sudden the GPU-Util never changes from 0% even during inference. I am unsure what has changed. I verified and everything looks good on code part. Any idea?

simon-mo · January 11, 2022, 10:01pm

Volatile GPU Util is instantaneous and mostly an estimate, see cuda - nvidia-smi Volatile GPU-Utilization explanation? - Stack Overflow. Did the inference latency spike up as well?

shamil · January 11, 2022, 10:08pm

Hi @simon-mo ,

I had same thoughts. The answer is no, the inference latency time did not spike up. It in fact remained the same!
This is making me question if I even need gpu? When I make 3 model replicas, by default they go into 3 cores and I can see their usage spike. The question is can a model replica actor only require gpu, and not cpu? I feel cpu computation is being taken care by cpu, so the gpu actors are completely ignored.

simon-mo · January 11, 2022, 10:58pm

Unfortunately this entire depends on your workload I would recommend microbenchmark your model on cpu vs gpu to compare their latency

shamil · January 11, 2022, 11:06pm

Correct, I figured i may not need gpu in my case. I forced ray serve to not use cpu by

ray_actor_options={“num_cpus”: 0,“num_gpus”: 0.33},num_replicas = 3

However, dashboard still shows CPU usage spiking up whereas GPU usage is minimal. Would be nice to know why that may be happening. Expected behaviour would be to only see gpu usage.

simon-mo · January 11, 2022, 11:40pm

“num_cpus”: 0 is only just placement constraint, not hard utilization constraint. Your application will use CPU regardless of this value.

Ray doesn’t automatically transform your model to use CPU vs GPU. You should take a look at the model to choose what to run on CPU vs GPU. Even if the entire model is on GPU, there are cost of coordinating the GPU instructions from CPU and movement.

shamil · January 17, 2022, 9:58pm

@simon-mo
Thanks, I am in agreement with costs of coordinating GPU from CPU. I did an experiment to run same model one using Ray’s API vs Ray’s remote. The code uses GPU only when I use Ray’s remote but not when I create model as API endpoint. Here is the code:

Ray Serve Model API:
@serve.deployment(route_prefix="/sentiment",ray_actor_options={“num_gpus”: 0.5},num_replicas = 2)
class MLflowBackend:
def init(self, model_uri):
spacy.prefer_gpu()
self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

async def __call__(self, request):
    data = await request.body()
    df = pd.DataFrame([data])
    cats = self.model.predict(df)
    return cats['predictions'][0]

#model inference
for i in data[:6000]:
response = requests.post(“http://127.0.0.1:8000/sentiment”,data = i)

Ray’ Serve Using Remote:
@serve.deployment(name=“sentiment”,ray_actor_options={“num_gpus”: 0.5},num_replicas = 2)
class MLflowBackend1:
def init(self, model_uri):
spacy.prefer_gpu()
self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

def __call__(self, request):
    data = request
    df = pd.DataFrame([data])
    cats = self.model.predict(df)
    return cats['predictions'][0]

sentiment = MLflowBackend1.get_handle()
#model inference
for i in data[:10000]:
r= sentiment .remote(text)
ret = ray.get(r)

Overall, slight change but one uses GPU while other method using endpoint does not. Any particular reason why?

simon-mo · January 19, 2022, 6:55pm

Both should use GPU. Your second case (using handle.remote) shows GPU usage probably because it is fast to directly send it instead of parsing HTTP request. Since the second case has higher load, the GPU usage actually shows up. The first case probably still use GPU but because the load is smaller (less throughput due to the HTTP parsing overhead), the GPU is not saturated and usage is not showing up.

You can use a load test tool (for example locust) to increase the http load to visualize GPU usage.

Topic		Replies	Views
Ray Actor not utilising GPU Ray Core	7	236	November 6, 2024
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	975	January 13, 2022
Has anyone tried Ray Serve with NVIDIA MPS Ray Serve	1	692	March 13, 2024
Not sure how num_replicas works Ray Serve	5	1696	March 4, 2021
Ray Serve not distributing load to all replicas equally Ray Serve	3	43	June 20, 2025

Ray Serve Model Worker Replicas Created But GPU Usage is 0% during Inference

Related topics