Ray Serve is not able to manage The Resources properly while serving multiple apps on the GPU
I have 4 applications running in separate containers with separate config.yaml files, But the ray head is common for everyone, In each application I am putting GPU as 1 so each should use the entire GPU, When I hit the request to all 4 of them together all the workers come up and CUDA out of memory comes.
I am running 4 LLMs in separate containers Here is my config looks like
First Config :
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12002’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1
third Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12004’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1
Second Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12003’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1
Fourth Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12005’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1
Can anyone please update me what am I doing wrong here