Hi!
I was wondering if there’s a way to have multiple models request tensor parallelism across multi-gpu instances.
For example, I have a simple 4xRTX5090 setup and I’d like to spread those two models over 4 GPUs, and there should be enough GPU capacity for that:
applications:
- args:
llm_configs:
- model_loading_config:
model_id: modelA
model_source: some-11B-model
engine_kwargs:
tensor_parallel_size: 4
dtype: "auto"
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 1
- model_loading_config:
model_id: llama-70b
model_source: RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16
engine_kwargs:
tensor_parallel_size: 4
max_model_len: 8192
dtype: "auto"
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 1
import_path: ray.serve.llm:build_openai_app
name: llm_app
route_prefix: "/"
However, this won’t schedule because it seems that each time I specify the tensor parallel argument I can see in the requests: GPU:1 times tensor_parallel_size, which amounts to 8, so one model deploys and the is pending.
Is there a way to make any two or models share the GPUs with tensor_parallel_size?