Understanding Ray Serve Memory Consumption

tzhang · December 14, 2023, 4:19pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m hosting a ray serve (single node) application for a ML inference API service, and getting OOM error. I’m running it on a GCP machine with 4 CPUs, 16GB CPU memory, 1 GPU with 15GB memory (n1-standard-4). The model is a distilbert model (~500mb) and it’s loaded on GPU. I run with 10 replicas and when I send some requests I get OOM error on CPU (> 0.95 threshold), causing the resource manager to kill actors and resulting in 500 internal server error on the killed request. And when I check the CPU consumption with top, I see:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    292 xxxxxxxx  35  15   16.2g   2.0g 580528 S   0.7  7.4   8:27.45 ray::ServeRepli
    293 xxxxxxxx  35  15   16.2g   2.1g 582464 S   0.7  7.4   8:27.42 ray::ServeRepli
    294 xxxxxxxx  35  15   16.2g   2.0g 583848 S   0.7  7.4   8:27.28 ray::ServeRepli
    295 xxxxxxxx  35  15   16.1g   2.0g 584568 S   0.7  7.4   8:28.87 ray::ServeRepli
    314 xxxxxxxx  35  15   16.2g   2.0g 584668 S   0.7  7.3   8:27.19 ray::ServeRepli
    338 xxxxxxxx  35  15   16.1g   2.0g 584172 S   0.7  7.3   8:27.38 ray::ServeRepli

If the model itself is on GPU, why does each of the replica taking so much CPU memory, and is there a way to reduce the CPU usage?

My deployment config:

  - name: MLAPI
    num_replicas: 10
    ray_actor_options:
      num_cpus: 0.3
      num_gpus: 0.1

Ray status resource usage:

Usage:
 3.0/4.0 CPU
 1.0/1.0 GPU
 0B/9.47GiB memory
 44B/4.74GiB object_store_memory

Any help would be appreciated!

Huaiwei_Sun · December 14, 2023, 5:28pm

Changing the category to Ray Serve. cc: @Gene for thoughts

Gene · December 14, 2023, 5:40pm

Not an expert on this, but I feel this has to do with how the model is loaded. Have you tried using model.to("cuda:0") to send the model to the GPU?

This article might help debug where the memory is leaked Memory Leakage with PyTorch. If you’re reading this post, then most… | by Raghad Alghonaim | Medium

Gene · December 14, 2023, 5:47pm

Also just want to highlight those ray_actor_options are “logical” resources Resource Allocation — Ray 2.8.1

Ray Serve doesn’t prevent users to specify the resources more than what’s physically presented on the machine. Users would need to measure how much each type of resources their application needs and use the options such as num_cpus, num_gpus, memory…etc to tell Serve not to over schedule replicas if the cluster doesn’t have enough resources.

tzhang · December 14, 2023, 6:46pm

I’m pretty sure the model is loaded on GPU since nvidia-smi command shows GPU memory usage correctly for the number of replicas I have created.

tzhang · December 14, 2023, 6:47pm

Thanks for the pointer. Yes, I read about it and I think I understand that. I guess my question was really what is exactly happening with that 7.4% memory per replica, from my other logs, that’s basically ~1.6GB of memory taken from CPU.

Gene · December 14, 2023, 11:07pm

My best guess is maybe there are some references to the model in your deployment code? Or issue with pytorch? Maybe this thread helps? How to free CPU RAM after `module.to(cuda_device)`? - PyTorch Forums Can setup memory_profiler to track the memory usage.

Topic		Replies	Views
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	1000	January 13, 2022
Resources allocation during serve deployment Ray Serve	5	677	December 3, 2022
Ray Serve Model Worker Replicas Created But GPU Usage is 0% during Inference Ray Serve	7	984	January 19, 2022
24/7 Soak test result: Ray can't recover from OOM errors Ray Serve	6	544	December 20, 2022
Ray OOM issue when the ray serve is launched	4	552	December 14, 2022

Understanding Ray Serve Memory Consumption

Related topics