Ray Serve vLLM multiple models per GPU in tensor parallelism

LemurPwned · August 10, 2025, 11:18am

Hi!
I was wondering if there’s a way to have multiple models request tensor parallelism across multi-gpu instances.
For example, I have a simple 4xRTX5090 setup and I’d like to spread those two models over 4 GPUs, and there should be enough GPU capacity for that:

applications:
  - args:
      llm_configs:
      - model_loading_config:
          model_id: modelA
          model_source: some-11B-model
        engine_kwargs:
          tensor_parallel_size: 4
          dtype: "auto"
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 1
      - model_loading_config:
          model_id: llama-70b
          model_source: RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16
        engine_kwargs:
          tensor_parallel_size: 4
          max_model_len: 8192
          dtype: "auto"
        deployment_config:
          autoscaling_config:
              min_replicas: 1
              max_replicas: 1
    import_path: ray.serve.llm:build_openai_app
    name: llm_app
    route_prefix: "/"

However, this won’t schedule because it seems that each time I specify the tensor parallel argument I can see in the requests: GPU:1 times tensor_parallel_size, which amounts to 8, so one model deploys and the is pending.
Is there a way to make any two or models share the GPUs with tensor_parallel_size?

kourosh · August 14, 2025, 6:37pm

In your deployment above you are specifying that each replica of each model needs 4 GPUs, so not being able to schedule makes sense.

I don’t know of any way to share gpus betwen two models yet. This goes back to fractional gpu support which is a problem that is not prioritized right now, but if you are interested you can drive that. Basically one exercise we should do is to see if we can instantiate and query two LLMs with two actors where each take half gpu. I think the main challenge for doing that will show themselves if we do this exercise.

Topic		Replies	Views
Serving LLM with multiple gpus Ray Serve	0	334	July 3, 2024
vLLM Inferencing on multiGPU Ray Serve	7	1374	September 24, 2024
Best practices to run multiple models in multiple GPUs in RayLLM Ray Train	0	774	February 8, 2024
Serve the same model replicas on the same GPU Ray Serve	0	130	May 23, 2024
Sequence/Tensor Parallelism with Ray Serve	2	616	May 23, 2024

Ray Serve vLLM multiple models per GPU in tensor parallelism

Related topics