Ray Serve LLM on KubeRay with CPU

1. Severity of the issue:
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.54.0

  • Python version: 3.11

  • OS: WSL Ubuntu

  • Cloud/Infrastructure:

    • Kubernetes Minikube (KubeRay / RayService CRD)

    • CPU‑only cluster (no GPUs available)

  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: Ray Serve LLM (ray.serve.llm:build_openai_app) to run on KubeRay on a CPU only Minikube Cluster

  • Actual: Unable to run.

4. Description/Question

I’m trying to run Ray Serve LLM (ray.serve.llm) on a CPU‑only Kubernetes cluster (minikube) for experimental/prototyping reasons, but I haven’t been able to get it to fully work.

I deployed Ray the KubeRay operator and RayService on a CPU‑only minikube cluster (no GPU resources at all). The cluster, Serve controller and pods start successfully, and the application status eventually shows as running. However:

  • No inference endpoints ever become reachable

  • No model‑related logs appear (no model loading, no vLLM engine logs, etc.)

  • Serve replicas remain stuck in an “initializing” state indefinitely

  • The Ray dashboard logs repeatedly warn that replicas are taking too long to initialize

  • There are no explicit crashes or errors indicating what is wrong

From the outside, the app appears to be running, but it never becomes usable.

What I tried

  1. Using the official Ray LLM stack

    • Started from the official rayproject/ray-llm images

    • Used ray.serve.llm.build_openai_app

    • Set use_cpu: true and ensured no GPU resources (num_gpus: 0)

    • Tested with a very small model (facebook/opt-125m)

  2. Creating a custom CPU‑only Docker image

    • Based on rayproject/ray:2.54.0-py311-cpu

    • Installed the CPU‑only vLLM wheel (vllm 0.15.1+cpu)

    • Installed required system dependencies (gcc, libnuma1, etc.)

    • Pinned NumPy to a compatible version

    • Verified the image builds and runs correctly

Despite this, the behavior is the same: Ray Serve LLM deployments appear to start but never finish initializing, and no endpoints are exposed.

Question

Is it actually possible to run Ray Serve LLM on a CPU‑only cluster with no GPU resources at all?

More specifically:

  • Is Ray Serve LLM expected to work on CPU without GPUs, or is GPU availability required/assumed in practice?

  • Is there any non‑vLLM CPU backend for Ray Serve LLM, or is vLLM mandatory?

I’m mainly trying to understand whether this is:

  • a misconfiguration on my side, or

  • a known limitation / unsupported use case of Ray Serve LLM today.

Ray Serve LLM (ray.serve.llm:build_openai_app) is designed primarily for GPU-backed inference and, as of now, does not officially support CPU-only clusters for vLLM-based deployments. While there is ongoing interest and some experimental workarounds for CPU-only support (such as building custom Docker images with CPU-only vLLM), these are not yet fully supported or documented as production-ready flows. The official Ray LLM stack expects GPU resources, and vLLM itself is optimized for GPU inference; thus, deployments on CPU-only clusters typically hang during initialization without clear errors, as you observed. See Ray GitHub Issue #56636 and Ray GitHub Issue #53603.

There is currently no alternative non-vLLM CPU backend for Ray Serve LLM; vLLM is the only supported engine, and it is GPU-centric. If you need CPU-only LLM inference, you may need to use other frameworks or wait for official support in Ray Serve LLM. Your experience is a known limitation, not a misconfiguration. Would you like more detail on possible workarounds or tracking progress for CPU support?

Sources:

Hint: Mention @RunLLM in the post for followups.