Understanding Ray Serve Memory Consumption

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m hosting a ray serve (single node) application for a ML inference API service, and getting OOM error. I’m running it on a GCP machine with 4 CPUs, 16GB CPU memory, 1 GPU with 15GB memory (n1-standard-4). The model is a distilbert model (~500mb) and it’s loaded on GPU. I run with 10 replicas and when I send some requests I get OOM error on CPU (> 0.95 threshold), causing the resource manager to kill actors and resulting in 500 internal server error on the killed request. And when I check the CPU consumption with top, I see:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    292 xxxxxxxx  35  15   16.2g   2.0g 580528 S   0.7  7.4   8:27.45 ray::ServeRepli
    293 xxxxxxxx  35  15   16.2g   2.1g 582464 S   0.7  7.4   8:27.42 ray::ServeRepli
    294 xxxxxxxx  35  15   16.2g   2.0g 583848 S   0.7  7.4   8:27.28 ray::ServeRepli
    295 xxxxxxxx  35  15   16.1g   2.0g 584568 S   0.7  7.4   8:28.87 ray::ServeRepli
    314 xxxxxxxx  35  15   16.2g   2.0g 584668 S   0.7  7.3   8:27.19 ray::ServeRepli
    338 xxxxxxxx  35  15   16.1g   2.0g 584172 S   0.7  7.3   8:27.38 ray::ServeRepli

If the model itself is on GPU, why does each of the replica taking so much CPU memory, and is there a way to reduce the CPU usage?

My deployment config:

  - name: MLAPI
    num_replicas: 10
      num_cpus: 0.3
      num_gpus: 0.1

Ray status resource usage:

 3.0/4.0 CPU
 1.0/1.0 GPU
 0B/9.47GiB memory
 44B/4.74GiB object_store_memory

Any help would be appreciated!

Changing the category to Ray Serve. cc: @Gene for thoughts

Not an expert on this, but I feel this has to do with how the model is loaded. Have you tried using model.to("cuda:0") to send the model to the GPU?

This article might help debug where the memory is leaked Memory Leakage with PyTorch. If you’re reading this post, then most… | by Raghad Alghonaim | Medium

Also just want to highlight those ray_actor_options are “logical” resources Resource Allocation — Ray 2.8.1

Ray Serve doesn’t prevent users to specify the resources more than what’s physically presented on the machine. Users would need to measure how much each type of resources their application needs and use the options such as num_cpus, num_gpus, memory…etc to tell Serve not to over schedule replicas if the cluster doesn’t have enough resources.

I’m pretty sure the model is loaded on GPU since nvidia-smi command shows GPU memory usage correctly for the number of replicas I have created.

Thanks for the pointer. Yes, I read about it and I think I understand that. I guess my question was really what is exactly happening with that 7.4% memory per replica, from my other logs, that’s basically ~1.6GB of memory taken from CPU.

My best guess is maybe there are some references to the model in your deployment code? Or issue with pytorch? Maybe this thread helps? How to free CPU RAM after `module.to(cuda_device)`? - PyTorch Forums Can setup memory_profiler to track the memory usage.

1 Like