Serving Multiple Applications with Ray serve in separate Docker Containers

wgetdd.deb · June 14, 2024, 6:39am

Ray Serve is not able to manage The Resources properly while serving multiple apps on the GPU

I have 4 applications running in separate containers with separate config.yaml files, But the ray head is common for everyone, In each application I am putting GPU as 1 so each should use the entire GPU, When I hit the request to all 4 of them together all the workers come up and CUDA out of memory comes.

I am running 4 LLMs in separate containers Here is my config looks like

First Config :
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12002’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

third Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12004’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

Second Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12003’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

Fourth Config:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: ‘12005’
applications:
- name: app_llm
import_path: main_app:app_llm
route_prefix: /doc_cls_lmv2
runtime_env: {}
deployments:
- name: llm
max_ongoing_requests: 1
ray_actor_options:
num_gpus: 1.0
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
upscale_smoothing_factor: null
downscale_smoothing_factor: null
downscale_delay_s: 3.0
upscale_delay_s: 0.1

Can anyone please update me what am I doing wrong here

wgetdd.deb · June 14, 2024, 6:41am

@cindy_zhang @Sihan_Wang

wgetdd.deb · June 16, 2024, 5:35am

@Sam_Chan , any thing you can comment about this please

Sam_Chan · June 18, 2024, 7:31pm

Is each container sharing the same Docker or are they all different (with different lib dependencies etc)?

Cindy_Zhang1 · June 18, 2024, 7:35pm

@wgetdd.deb Are these 4 different ray clusters?

Topic		Replies	Views
Ray Serve container runtime_env cannot use GPU Ray Serve	3	778	December 6, 2023
Multi GPU Usage on Multi VM\|Ray cluster on multi VM instances Ray Clusters	5	1400	January 17, 2025
Serving LLM with multiple gpus Ray Serve	0	275	July 3, 2024
Gpu allocation for ray serve on multi gpu environment Ray Serve	5	268	November 18, 2024
Ray serve GPU allocation error, deployment consuming all 8 GPU even though setting num_gpus=4 Ray Serve	1	660	February 2, 2024

Serving Multiple Applications with Ray serve in separate Docker Containers

Ray Serve is not able to manage The Resources properly while serving multiple apps on the GPU

Related topics