Multiple deployments on Ray service gets stuck on pending placement group

Hello, I’m trying to deploy one ray service on an AKS with multiple deployments deployed through LLMConfig, but I always get the worker pending, because ray tries to schedule more than 1 GPU.
{'CPU': 2.0, 'GPU': 0.1}: 1+ pending tasks/actors (1+ using placement groups)
{'CPU': 12.0, 'GPU': 0.5}: 1+ pending tasks/actors (1+ using placement groups)
{'GPU': 1.1, 'CPU': 2.0} * 1 (PACK): 2+ pending placement groups

I tried to reduce different sizes in ray_actor_options (it was 0.9 for the LLM model first time I had this issue, then reduced to 0.5, but nothing changed) and I also tried to reduce the model len as well as the model weights (moving from 32B to 8B) however there will be always a pending placement that will sum up more than 1 gpu, thus making everything stuck.
I’m able to deploy single deployment**

Versions / Dependencies**

stock rayproject/ray-llm:2.46.0-py311-cu124

Ray 2.46.0
py311
cu124

Differences in libraries from original image:
“vllm>=0.8.5” “transformers>=4.56.0”

Hardware:
A100 node pool on an AKS on azure

Reproduction script

This is my python

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

=========================

Qwen3 8B Chat Model

=========================

chat_llm = LLMConfig(
model_loading_config={
"model_id": "Qwen/Qwen3-8B",
},
engine_kwargs={
"max_model_len": 8000, # full long context
"dtype": "bfloat16",
"gpu_memory_utilization": 0.5, # use 50% of A100 GPU memory
"trust_remote_code": True,
"enable_auto_tool_choice": True, # enables automatic tool usage
"tool_call_parser": "hermes", # for function/tool-call reasoning
},
deployment_config={
"ray_actor_options": {
"num_gpus": 0.5,
"num_cpus": 12,
},
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1,
"target_ongoing_requests": 64,
},
"max_ongoing_requests": 128,
},
)

=========================

Qwen3 0.6B Embedding Model

=========================

embedding_llm = LLMConfig(
model_loading_config={
"model_id": "Qwen/Qwen3-Embedding-0.6B",
},
engine_kwargs={
"max_model_len": 1000,
"dtype": "bfloat16",
"trust_remote_code": True,
"task": "embed",
},
deployment_config={
"ray_actor_options": {
"num_gpus": 0.1,
"num_cpus": 2,
},
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1,
"target_ongoing_requests": 64,
},
"max_ongoing_requests": 128,
},
)

=========================

Build one OpenAI-compatible app

=========================

llm_app = build_openai_app({
"llm_configs": [chat_llm, embedding_llm]
})

This is the ray status I get when I deploy the pod:

ray status

Active:
1 headgroup
Idle:
1 gpu-group
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
1.0/24.0 CPU
0.0/1.0 GPU
0B/37.31GiB memory
0B/11.84GiB object_store_memory

From request_resources:
(none)
Pending Demands:
{'CPU': 2.0, 'GPU': 0.1}: 1+ pending tasks/actors (1+ using placement groups)
{'CPU': 12.0, 'GPU': 0.5}: 1+ pending tasks/actors (1+ using placement groups)
{'GPU': 1.1, 'CPU': 2.0} * 1 (PACK): 2+ pending placement groups`

My yaml config file:

apiVersion: ray.io/v1
kind: RayService
metadata:
name: ray-qwen3-openai-llm-embed
spec:
serveConfigV2: |
applications:
- name: qwen3
import_path: serve_qwen3_openai_app:llm_app
route_prefix: "/"
deployments:
# --- Deployment 1: The Chat/LLM Model ---
- name: Qwen3-Chat
# We explicitly define the resources needed for this deployment
ray_actor_options:
num_gpus: 0.6
num_cpus: 12

    # --- Deployment 2: The Embedder Model ---
    - name: EmbeddingService
      num_replicas: 1
      ray_actor_options:
        num_gpus: 0.1
        num_cpus: 2


rayClusterConfig:
rayVersion: "2.46.0"

headGroupSpec:
  rayStartParams:
    dashboard-host: '0.0.0.0'
  template:
    metadata:
      annotations:
        ray.io/disable-probes: "true"   # ✅ Prevent operator from overwriting probes
    spec:
      containers:
      - name: ray-head
        image: <container_registry>/ray-qwen3-llm-embed-openai:latest
        env:
        - name: PYTHONPATH
          value: /serve_app
        command: ["/bin/bash", "-c"]
        args:
          - |
            ray start --head --dashboard-host=0.0.0.0 --port=6379 && \
            serve run serve_qwen3_openai_app:llm_app
        resources:
          limits:
            cpu: 4
            memory: 8Gi
        ports:
        - containerPort: 6379
          name: gcs-server
        - containerPort: 8265 # Ray dashboard
          name: dashboard
        - containerPort: 10001
          name: client
        - containerPort: 8000
          name: serve
        # Dummy probes (won’t be used if annotation disables them)
        livenessProbe:
          exec:
            command: ["/bin/sh", "-c", "echo live"]
          initialDelaySeconds: 3600
          periodSeconds: 600
          timeoutSeconds: 5
          failureThreshold: 120
        readinessProbe:
          exec:
            command: ["/bin/sh", "-c", "echo ready"]
          initialDelaySeconds: 3600
          periodSeconds: 600
          timeoutSeconds: 5
          failureThreshold: 120

workerGroupSpecs:
- groupName: gpu-group
  replicas: 1
  rayStartParams:
    num-gpus: "1"
    #resources: '{"accelerator_type:A100": 1}'
  template:
    metadata:
      annotations:
        ray.io/disable-probes: "true"   # ✅ Disable probes for worker too
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"
      containers:
      - name: ray-worker
        image: <container_registry>/ray-qwen3-llm-embed-openai:latest
        env:
        - name: PYTHONPATH
          value: /serve_app
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: 20
            memory: 32Gi
        # Dummy probes (won’t be active due to annotation)
        livenessProbe:
          exec:
            command: ["/bin/sh", "-c", "echo live"]
          initialDelaySeconds: 3600
          periodSeconds: 600
          timeoutSeconds: 5
          failureThreshold: 120
        readinessProbe:
          exec:
            command: ["/bin/sh", "-c", "echo ready"]
          initialDelaySeconds: 3600
          periodSeconds: 600
          timeoutSeconds: 5
          failureThreshold: 120




`

Your issue is due to Ray Serve/vLLM’s placement group logic: when you deploy multiple LLMConfig-based deployments with fractional GPU requests (e.g., 0.5 and 0.1), Ray’s placement group will sum the GPU requirements for all deployments in the group, resulting in a total GPU demand (e.g., 0.6 + 0.1 = 0.7, but due to internal logic, it may round up or require >1 GPU for placement), which cannot be satisfied by a single 1-GPU node. This is a known limitation—fractional GPU allocation for multiple deployments often leads to infeasible placement groups and pending actors, as Ray tries to allocate more than 1 GPU per placement group, even if each deployment requests less than 1 GPU individually. This is confirmed in the Ray issue tracker, where users report that Ray Serve LLMConfig deployments with fractional GPUs always result in pending placement groups and unschedulable actors, regardless of the specific fractional values used (Rayserve LLMConfig not working for fractional resource allocation).

There is currently no supported workaround in Ray 2.46.0 for running multiple LLMConfig deployments with fractional GPU allocation on a single-GPU node. The only way to avoid this is to deploy each model on a separate node with at least 1 GPU, or to use whole GPU allocations (num_gpus: 1) per deployment. Would you like a step-by-step breakdown of the placement group logic and possible future workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.