Ray Serve Model Worker Replicas Created But GPU Usage is 0% during Inference

Hi All,

Up until yesterday, I was able to see Volatile GPU-Util spike up during inference on a model utilizing the 3 replicas I made. However, for no real reason, all a sudden the GPU-Util never changes from 0% even during inference. I am unsure what has changed. I verified and everything looks good on code part. Any idea?
image

Volatile GPU Util is instantaneous and mostly an estimate, see cuda - nvidia-smi Volatile GPU-Utilization explanation? - Stack Overflow. Did the inference latency spike up as well?

Hi @simon-mo ,

I had same thoughts. The answer is no, the inference latency time did not spike up. It in fact remained the same!
This is making me question if I even need gpu? When I make 3 model replicas, by default they go into 3 cores and I can see their usage spike. The question is can a model replica actor only require gpu, and not cpu? I feel cpu computation is being taken care by cpu, so the gpu actors are completely ignored.

Unfortunately this entire depends on your workload I would recommend microbenchmark your model on cpu vs gpu to compare their latency

Correct, I figured i may not need gpu in my case. I forced ray serve to not use cpu by

ray_actor_options={“num_cpus”: 0,“num_gpus”: 0.33},num_replicas = 3

However, dashboard still shows CPU usage spiking up whereas GPU usage is minimal. Would be nice to know why that may be happening. Expected behaviour would be to only see gpu usage.

“num_cpus”: 0 is only just placement constraint, not hard utilization constraint. Your application will use CPU regardless of this value.

Ray doesn’t automatically transform your model to use CPU vs GPU. You should take a look at the model to choose what to run on CPU vs GPU. Even if the entire model is on GPU, there are cost of coordinating the GPU instructions from CPU and movement.

@simon-mo
Thanks, I am in agreement with costs of coordinating GPU from CPU. I did an experiment to run same model one using Ray’s API vs Ray’s remote. The code uses GPU only when I use Ray’s remote but not when I create model as API endpoint. Here is the code:

Ray Serve Model API:
@serve.deployment(route_prefix="/sentiment",ray_actor_options={“num_gpus”: 0.5},num_replicas = 2)
class MLflowBackend:
def init(self, model_uri):
spacy.prefer_gpu()
self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

async def __call__(self, request):
    data = await request.body()
    df = pd.DataFrame([data])
    cats = self.model.predict(df)
    return cats['predictions'][0]  

#model inference
for i in data[:6000]:
response = requests.post(“http://127.0.0.1:8000/sentiment”,data = i)

Ray’ Serve Using Remote:
@serve.deployment(name=“sentiment”,ray_actor_options={“num_gpus”: 0.5},num_replicas = 2)
class MLflowBackend1:
def init(self, model_uri):
spacy.prefer_gpu()
self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

def __call__(self, request):
    data = request
    df = pd.DataFrame([data])
    cats = self.model.predict(df)
    return cats['predictions'][0]  

sentiment = MLflowBackend1.get_handle()
#model inference
for i in data[:10000]:
r= sentiment .remote(text)
ret = ray.get(r)

Overall, slight change but one uses GPU while other method using endpoint does not. Any particular reason why?

Both should use GPU. Your second case (using handle.remote) shows GPU usage probably because it is fast to directly send it instead of parsing HTTP request. Since the second case has higher load, the GPU usage actually shows up. The first case probably still use GPU but because the load is smaller (less throughput due to the HTTP parsing overhead), the GPU is not saturated and usage is not showing up.

You can use a load test tool (for example locust) to increase the http load to visualize GPU usage.