Ray Serve vLLM multiple models per GPU in tensor parallelism

Hi!
I was wondering if there’s a way to have multiple models request tensor parallelism across multi-gpu instances.
For example, I have a simple 4xRTX5090 setup and I’d like to spread those two models over 4 GPUs, and there should be enough GPU capacity for that:

applications:
  - args:
      llm_configs:
      - model_loading_config:
          model_id: modelA
          model_source: some-11B-model
        engine_kwargs:
          tensor_parallel_size: 4
          dtype: "auto"
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 1
      - model_loading_config:
          model_id: llama-70b
          model_source: RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16
        engine_kwargs:
          tensor_parallel_size: 4
          max_model_len: 8192
          dtype: "auto"
        deployment_config:
          autoscaling_config:
              min_replicas: 1
              max_replicas: 1
    import_path: ray.serve.llm:build_openai_app
    name: llm_app
    route_prefix: "/"

However, this won’t schedule because it seems that each time I specify the tensor parallel argument I can see in the requests: GPU:1 times tensor_parallel_size, which amounts to 8, so one model deploys and the is pending.
Is there a way to make any two or models share the GPUs with tensor_parallel_size?

In your deployment above you are specifying that each replica of each model needs 4 GPUs, so not being able to schedule makes sense.

I don’t know of any way to share gpus betwen two models yet. This goes back to fractional gpu support which is a problem that is not prioritized right now, but if you are interested you can drive that. Basically one exercise we should do is to see if we can instantiate and query two LLMs with two actors where each take half gpu. I think the main challenge for doing that will show themselves if we do this exercise.