Serving Multiple Applications with Ray serve in separate Docker Containers

Ray Serve is not able to manage The Resources properly while serving multiple apps on the GPU

I have 4 applications running in separate containers with separate config.yaml files, But the ray head is common for everyone, In each application I am putting GPU as 1 so each should use the entire GPU, When I hit the request to all 4 of them together all the workers come up and CUDA out of memory comes.

I am running 4 LLMs in separate containers Here is my config looks like

First Config :
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12002’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

third Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12004’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

Second Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12003’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

Fourth Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12005’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

Can anyone please update me what am I doing wrong here

@cindy_zhang @Sihan_Wang

@Sam_Chan , any thing you can comment about this please

Is each container sharing the same Docker or are they all different (with different lib dependencies etc)?

@wgetdd.deb Are these 4 different ray clusters?

1 Like