Hello, I deployed a custom docker image having 2 models (an LLM and an embedding models).
I’m able to deploy the ray service, however both the head and workers get not ready, and keeps restarting.
This is my custom docker image:
FROM rayproject/ray-llm:2.46.0-py311-cu124
WORKDIR /serve_app
COPY serve_qwen3_openai_app.py /serve_app/serve_qwen3_openai_app.py
RUN pip uninstall -y vllm || true
RUN pip install "vllm>=0.8.5" "transformers>=4.56.0"
This is my serve_qwen3_openai_app.py
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
# =========================
# Qwen3 32B Chat Model
# =========================
chat_llm = LLMConfig(
model_loading_config={
"model_id": "Qwen/Qwen3-32B",
},
engine_kwargs={
"max_model_len": 30000, # full long context
"gpu_memory_utilization": 0.90, # use 90% of A100 GPU memory
"dtype": "bfloat16",
"trust_remote_code": True,
"enable_auto_tool_choice": True, # enables automatic tool usage
"tool_call_parser": "hermes", # for function/tool-call reasoning
},
accelerator_type="A100",
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1,
"target_ongoing_requests": 64,
},
"max_ongoing_requests": 128,
},
)
# =========================
# Qwen3 0.6B Embedding Model
# =========================
embedding_llm = LLMConfig(
model_loading_config={
"model_id": "Qwen/Qwen3-Embedding-0.6B",
},
engine_kwargs={
"max_model_len": 2048,
"gpu_memory_utilization": 0.10, # small model → lightweight on GPU
"dtype": "bfloat16",
"trust_remote_code": True,
},
accelerator_type="A100",
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1,
"target_ongoing_requests": 64,
},
"max_ongoing_requests": 128,
},
)
# =========================
# Build one OpenAI-compatible app
# =========================
llm_app = build_openai_app({
"llm_configs": [chat_llm, embedding_llm]
})
# Deploy it
serve.run(llm_app)
This is my ray service yaml file:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: ray-qwen3-openai
spec:
serveConfigV2: |
applications:
- name: qwen3
import_path: serve_qwen3_openai_app:llm_app
route_prefix: "/"
rayClusterConfig:
rayVersion: "2.46.0"
headGroupSpec:
template:
spec:
containers:
- name: ray-head
image: acrservice.azurecr.io/ray-qwen3-openai:vllm085
resources:
limits:
cpu: 4
memory: 8Gi
workerGroupSpecs:
- groupName: gpu-group
replicas: 1
rayStartParams:
num-gpus: "1"
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
containers:
- name: ray-worker
image: acrservice.azurecr.io/ray-qwen3-openai:vllm085
resources:
limits:
nvidia.com/gpu: "1"
cpu: 20
memory: 32Gi
Using command:
jubectl get pods -A -w I see the two pods like that:
default ray-qwen3-openai-js2w5-gpu-group-worker-jswfk 0/1 Running 4 (5m47s ago) 48m
default ray-qwen3-openai-js2w5-head-nbpzc 0/1 Running 4 (6m7s ago) 48m
If I check the logs of the pods, I see this:
root@2e11e7204b23:/workspaces/iaac_azure/custom_ray_serve_app# kubectl describe pod ray-qwen3-openai-v5nw9-head-l8gsh | tail -n 20
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 20m default-scheduler Successfully assigned default/ray-qwen3-openai-v5nw9-head-l8gsh to aks-system-50293565-vmss00000c
Normal Killing 10m kubelet Container ray-head failed liveness probe, will be restarted
Normal Pulled 10m (x2 over 20m) kubelet Container image "acrservice.azurecr.io/ray-qwen3-openai:vllm085" already present on machine
Normal Created 10m (x2 over 20m) kubelet Created container: ray-head
Normal Started 10m (x2 over 20m) kubelet Started container ray-head
Warning Unhealthy 5m43s (x186 over 20m) kubelet Readiness probe failed:
Warning Unhealthy 42s (x231 over 20m) kubelet Liveness probe failed:
So it’s like after 10m it restarts. Could this be related to the downloading of the models that takes time thus the liveness probe doesn’t reply with ready?
Did I do something wrong?