Ray pods keeps restarting

Hello, I deployed a custom docker image having 2 models (an LLM and an embedding models).

I’m able to deploy the ray service, however both the head and workers get not ready, and keeps restarting.

This is my custom docker image:

FROM rayproject/ray-llm:2.46.0-py311-cu124

WORKDIR /serve_app
COPY serve_qwen3_openai_app.py /serve_app/serve_qwen3_openai_app.py

RUN pip uninstall -y vllm || true
RUN pip install "vllm>=0.8.5" "transformers>=4.56.0"

This is my serve_qwen3_openai_app.py

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

# =========================
# Qwen3 32B Chat Model
# =========================
chat_llm = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen3-32B",
    },
    engine_kwargs={
        "max_model_len": 30000,               # full long context
        "gpu_memory_utilization": 0.90,       # use 90% of A100 GPU memory
        "dtype": "bfloat16",
        "trust_remote_code": True,
        "enable_auto_tool_choice": True,      # enables automatic tool usage
        "tool_call_parser": "hermes",         # for function/tool-call reasoning
    },
    accelerator_type="A100",
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 1,
            "target_ongoing_requests": 64,
        },
        "max_ongoing_requests": 128,
    },
)

# =========================
# Qwen3 0.6B Embedding Model
# =========================
embedding_llm = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen3-Embedding-0.6B",
    },
    engine_kwargs={
        "max_model_len": 2048,
        "gpu_memory_utilization": 0.10,       # small model → lightweight on GPU
        "dtype": "bfloat16",
        "trust_remote_code": True,
    },
    accelerator_type="A100",
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 1,
            "target_ongoing_requests": 64,
        },
        "max_ongoing_requests": 128,
    },
)

# =========================
# Build one OpenAI-compatible app
# =========================
llm_app = build_openai_app({
    "llm_configs": [chat_llm, embedding_llm]
})

# Deploy it
serve.run(llm_app)

This is my ray service yaml file:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-qwen3-openai
spec:
  serveConfigV2: |
    applications:
    - name: qwen3
      import_path: serve_qwen3_openai_app:llm_app
      route_prefix: "/"
  rayClusterConfig:
    rayVersion: "2.46.0"
    headGroupSpec:
      template:
        spec:
          containers:
          - name: ray-head
            image: acrservice.azurecr.io/ray-qwen3-openai:vllm085
            resources:
              limits:
                cpu: 4
                memory: 8Gi
    workerGroupSpecs:
    - groupName: gpu-group
      replicas: 1
      rayStartParams:
        num-gpus: "1"
      template:
        spec:
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Equal"
            value: "present"
            effect: "NoSchedule"
          containers:
          - name: ray-worker
            image: acrservice.azurecr.io/ray-qwen3-openai:vllm085
            resources:
              limits:
                nvidia.com/gpu: "1"
                cpu: 20
                memory: 32Gi

Using command:

jubectl get pods -A -w I see the two pods like that:

default ray-qwen3-openai-js2w5-gpu-group-worker-jswfk 0/1 Running 4 (5m47s ago) 48m
default ray-qwen3-openai-js2w5-head-nbpzc 0/1 Running 4 (6m7s ago) 48m

If I check the logs of the pods, I see this:

root@2e11e7204b23:/workspaces/iaac_azure/custom_ray_serve_app# kubectl describe pod ray-qwen3-openai-v5nw9-head-l8gsh | tail -n 20
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  20m                    default-scheduler  Successfully assigned default/ray-qwen3-openai-v5nw9-head-l8gsh to aks-system-50293565-vmss00000c
  Normal   Killing    10m                    kubelet            Container ray-head failed liveness probe, will be restarted
  Normal   Pulled     10m (x2 over 20m)      kubelet            Container image "acrservice.azurecr.io/ray-qwen3-openai:vllm085" already present on machine
  Normal   Created    10m (x2 over 20m)      kubelet            Created container: ray-head
  Normal   Started    10m (x2 over 20m)      kubelet            Started container ray-head
  Warning  Unhealthy  5m43s (x186 over 20m)  kubelet            Readiness probe failed:
  Warning  Unhealthy  42s (x231 over 20m)    kubelet            Liveness probe failed:

So it’s like after 10m it restarts. Could this be related to the downloading of the models that takes time thus the liveness probe doesn’t reply with ready?

Did I do something wrong?

Yes, your issue is likely due to the Ray Serve deployment taking too long to initialize (e.g., downloading large models), causing the Kubernetes liveness/readiness probes to fail and the pods to restart. This is a common problem when loading large LLMs, as the default probe timeouts may be too short for the model download and initialization process to complete. According to RayService troubleshooting docs, if the application takes too long to become ready, Kubernetes may repeatedly restart the pods, preventing successful deployment.

To resolve this, you should increase the initialDelaySeconds and timeoutSeconds for the liveness and readiness probes in your pod spec, giving the model enough time to download and initialize. This is not a Ray-specific issue but a Kubernetes configuration detail. Also, ensure your cluster has enough resources and network bandwidth for model downloads, as insufficient resources can also cause repeated restarts (Discourse: deployment init takes too long).

Would you like more detail on how to adjust your Kubernetes YAML to fix this?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM what should I change to stop the readiness probe check and give time to download the weights of the model?

It seems likely that you should increase the initialDelaySeconds and timeoutSeconds for the readiness and liveness probes in your RayService YAML under the pod spec for both head and worker containers. This gives your container more time to download and load the models before Kubernetes checks for readiness, preventing premature restarts. You can also temporarily remove or comment out the readiness/liveness probes to disable them, but increasing the timeouts is the recommended approach for production (RayService troubleshooting guide).

For example, add or modify the following under your container spec:

readinessProbe:
  httpGet:
    path: /-/healthz
    port: 8000
  initialDelaySeconds: 600  # Increase as needed (e.g., 10 minutes)
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 60
livenessProbe:
  httpGet:
    path: /-/healthz
    port: 8000
  initialDelaySeconds: 600
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 60

Would you like more detail or a full YAML example?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

I fixed the problem using a similar approach, now I can see the following when I get the pods:

default ray-qwen3-openai-xx58q-gpu-group-worker-gbrrj 1/1 Running 0 6h11m
default ray-qwen3-openai-xx58q-head-4rz87 1/1 Running 0 6h11m

however when I check ray cluster status:
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
ray-qwen3-openai-xx58q 1 1 24 40Gi 1 ready 13h

Ray status in the bash of the head node

(base) ray@ray-qwen3-openai-xx58q-head-4rz87:/serve_app$ ray status
======== Autoscaler status: 2025-10-26 14:31:51.790006 ========
Node status
---------------------------------------------------------------
Active:
 1 headgroup
 1 gpu-group
Idle:
 (no idle nodes)
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

However when I do a port forward, I’ve this error, and no endpoint works.

kubectl port-forward svc/ray-qwen3-openai-xx58q-head-svc 8000:8000
Forwarding from 127.0.0.1:8000 -> 8000
Handling connection for 8000
Handling connection for 8000

What can I do?

Hi @Antonio_B, thanks for posting! Can you exec into the head pod and run serve status? That will give us more clarity into what’s actually going wrong in the code.

Hello, actually I posted the ray status in the pod as you can see above. Actually I fixed this issue, and my pod starts successfully.

Now I’m trying to deploy two models into one ray service, using fractional gpu, however the pod never gets the right resources associated.

this is my situation:

my worker resources spec in the yaml file:
resources:
limits:
nvidia.com/gpu: "1"
cpu: 20
memory: 32Gi

My head resources in the yaml file:
resources:
limits:
cpu: 4
memory: 8Gi

My python file for the multiple models deployment:
from ray import serve

from ray.serve.llm import LLMConfig, build_openai_app




# =========================

# Qwen3 32B Chat Model

# =========================


chat_llm = LLMConfig(

model_loading_config={

"model_id": "Qwen/Qwen3-32B",

    },

engine_kwargs={

"max_model_len": 10000,               # full long context

"dtype": "bfloat16",

"gpu_memory_utilization": 0.90,       # use 90% of A100 GPU memory

"trust_remote_code": True,

"enable_auto_tool_choice": True,      # enables automatic tool usage

"tool_call_parser": "hermes",         # for function/tool-call reasoning

    },

#accelerator_type="A100",

deployment_config={

"ray_actor_options": {

"num_gpus": 0.7,

"num_cpus": 12,

        },

"autoscaling_config": {

"min_replicas": 1,

"max_replicas": 1,

"target_ongoing_requests": 64,

        },

"max_ongoing_requests": 128,

    },

)




# =========================

# Qwen3 0.6B Embedding Model

# =========================

embedding_llm = LLMConfig(

model_loading_config={

"model_id": "Qwen/Qwen3-Embedding-0.6B",

    },

engine_kwargs={

"max_model_len": 1000,

"dtype": "bfloat16",

"trust_remote_code": True,

"task": "embed",

    },

#accelerator_type="A100",

deployment_config={

"ray_actor_options": {

"num_gpus": 0.1,

"num_cpus": 2,

        },

"autoscaling_config": {

"min_replicas": 1,

"max_replicas": 1,

"target_ongoing_requests": 64,

        },

"max_ongoing_requests": 128,

    },

)




# =========================

# Build one OpenAI-compatible app

# =========================

llm_app = build_openai_app({

"llm_configs": [chat_llm, embedding_llm]

})




# Deploy it

serve.run(llm_app)

I’ve both the wokrer and the head running, however when I do ray status in my head node this is the situation:

======== Autoscaler status: 2025-10-29 23:29:48.092851 ========
Node status

Active:
1 headgroup
Idle:
1 gpu-group
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
1.0/24.0 CPU
0.0/1.0 GPU
0B/37.32GiB memory
0B/11.84GiB object_store_memory

From request_resources:
(none)
Pending Demands:
{‘CPU’: 12.0, ‘GPU’: 0.8}: 1+ pending tasks/actors (1+ using placement groups)
{‘CPU’: 2.0, ‘GPU’: 0.1}: 1+ pending tasks/actors (1+ using placement groups)
{‘CPU’: 2.0, ‘GPU’: 1.1} * 1 (PACK): 2+ pending placement groups

My question is:

  • what is the final line (PACK), and why ray is asking for more than the declared gpu I reserved for the two deployment?
  • why if I deploy only one model, no extra gpu is added?

What can I do to solve this?

@RunLLM @seiji

The placement group happens also I decrease the number of gpu for each deployment.
It never gets deployed, because the demand goes over the 1 GPU.

Node status

Active:
1 headgroup
Idle:
1 gpu-group
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
1.0/24.0 CPU
0.0/1.0 GPU
0B/37.31GiB memory
0B/11.84GiB object_store_memory

From request_resources:
(none)
Pending Demands:
{‘CPU’: 2.0, ‘GPU’: 0.1}: 1+ pending tasks/actors (1+ using placement groups)
{‘CPU’: 12.0, ‘GPU’: 0.7}: 1+ pending tasks/actors (1+ using placement groups)
{‘CPU’: 2.0, ‘GPU’: 1.1} * 1 (PACK): 2+ pending placement groups

Nice, glad the first issue is solved! For future reference, ray status and serve status are different commands; the latter gives information about the in-progress serve app.

For the next issue, actually fractional GPU deployments are not yet supported in Ray Serve LLM at this time. I suspect the strange resource requests are because the manually specified deployment/actor options are interacting with our default/happy path that sets resource requests per deployment atomically, based on TP/PP/DP size.