How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m hosting a ray serve (single node) application for a ML inference API service, and getting OOM error. I’m running it on a GCP machine with 4 CPUs, 16GB CPU memory, 1 GPU with 15GB memory (n1-standard-4). The model is a distilbert model (~500mb) and it’s loaded on GPU. I run with 10 replicas and when I send some requests I get OOM error on CPU (> 0.95 threshold), causing the resource manager to kill actors and resulting in 500 internal server error on the killed request. And when I check the CPU consumption with top
, I see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
292 xxxxxxxx 35 15 16.2g 2.0g 580528 S 0.7 7.4 8:27.45 ray::ServeRepli
293 xxxxxxxx 35 15 16.2g 2.1g 582464 S 0.7 7.4 8:27.42 ray::ServeRepli
294 xxxxxxxx 35 15 16.2g 2.0g 583848 S 0.7 7.4 8:27.28 ray::ServeRepli
295 xxxxxxxx 35 15 16.1g 2.0g 584568 S 0.7 7.4 8:28.87 ray::ServeRepli
314 xxxxxxxx 35 15 16.2g 2.0g 584668 S 0.7 7.3 8:27.19 ray::ServeRepli
338 xxxxxxxx 35 15 16.1g 2.0g 584172 S 0.7 7.3 8:27.38 ray::ServeRepli
If the model itself is on GPU, why does each of the replica taking so much CPU memory, and is there a way to reduce the CPU usage?
My deployment config:
- name: MLAPI
num_replicas: 10
ray_actor_options:
num_cpus: 0.3
num_gpus: 0.1
Ray status resource usage:
Usage:
3.0/4.0 CPU
1.0/1.0 GPU
0B/9.47GiB memory
44B/4.74GiB object_store_memory
Any help would be appreciated!