Ray Serve LLM on CPU with KubeRay

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.54.0
  • Python version: 3.11
  • OS: WSL Ubuntu
  • Cloud/Infrastructure:
    • Kubernetes Minikube (KubeRay / RayService CRD)
    • CPU‑only cluster (no GPUs available)
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: Ray Serve LLM (ray.serve.llm:build_openai_app) to run on KubeRay on a CPU only Minikube Cluster
  • Actual: Unable to run.

4. Description/Question

I’m trying to run Ray Serve LLM (ray.serve.llm) on a CPU‑only Kubernetes cluster (minikube) for experimental/prototyping reasons, but I haven’t been able to get it to fully work.

I deployed Ray the KubeRay operator and RayService on a CPU‑only minikube cluster (no GPU resources at all). The cluster, Serve controller and pods start successfully, and the application status eventually shows as running. However:

  • No inference endpoints ever become reachable

  • No model‑related logs appear (no model loading, no vLLM engine logs, etc.)

  • Serve replicas remain stuck in an “initializing” state indefinitely

  • The Ray dashboard logs repeatedly warn that replicas are taking too long to initialize

  • There are no explicit crashes or errors indicating what is wrong

From the outside, the app appears to be running, but it never becomes usable.

What I tried

  1. Using the official Ray LLM stack

    • Started from the official rayproject/ray-llm images

    • Used ray.serve.llm.build_openai_app

    • Set use_cpu: true and ensured no GPU resources (num_gpus: 0)

    • Tested with a very small model (facebook/opt-125m)

  2. Creating a custom CPU‑only Docker image

    • Based on rayproject/ray:2.54.0-py311-cpu

    • Installed the CPU‑only vLLM wheel (vllm 0.15.1+cpu)

    • Installed required system dependencies (gcc, libnuma1, etc.)

    • Pinned NumPy to a compatible version

    • Verified the image builds and runs correctly

Despite this, the behavior is the same: Ray Serve LLM deployments appear to start but never finish initializing, and no endpoints are exposed.

Question

Is it actually possible to run Ray Serve LLM on a CPU‑only cluster with no GPU resources at all?

More specifically:

  • Is Ray Serve LLM expected to work on CPU without GPUs, or is GPU availability required/assumed in practice?

  • Is there any non‑vLLM CPU backend for Ray Serve LLM, or is vLLM mandatory?

I’m mainly trying to understand whether this is:

  • a misconfiguration on my side, or

  • a known limitation / unsupported use case of Ray Serve LLM today.

Ray Serve LLM can run on CPU-only clusters, but this is still experimental and not officially supported for production. You must use a CPU-only vLLM wheel (e.g., vllm-0.15.1+cpu) and set use_cpu: True in your LLMConfig. However, some users report that even with these settings, deployments may hang or not expose endpoints, as vLLM and Ray Serve LLM are primarily designed for GPU inference and some features may not work as expected on CPU-only clusters. There is no alternative non-vLLM backend for Ray Serve LLM; vLLM is currently mandatory for LLM serving via this API. See the discussion and Dockerfile examples in Ray Issue #56636.

If you need a fully supported, production-grade CPU-only LLM serving solution, Ray Serve LLM may not be suitable at this time. For prototyping, you can try the custom Dockerfile and configuration described in the linked issue, but expect possible limitations or initialization hangs. Would you like a step-by-step guide for setting up a CPU-only Ray Serve LLM deployment, or more detail on the current limitations?

Sources:

Hint: Mention @RunLLM in the post for followups.