Ray Serve LLM on CPU with KubeRay

CaoimheCahill · April 30, 2026, 9:51am

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.54.0
Python version: 3.11
OS: WSL Ubuntu
Cloud/Infrastructure:
- Kubernetes Minikube (KubeRay / RayService CRD)
- CPU‑only cluster (no GPUs available)
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: Ray Serve LLM (ray.serve.llm:build_openai_app) to run on KubeRay on a CPU only Minikube Cluster
Actual: Unable to run.

4. Description/Question

I’m trying to run Ray Serve LLM (ray.serve.llm) on a CPU‑only Kubernetes cluster (minikube) for experimental/prototyping reasons, but I haven’t been able to get it to fully work.

I deployed Ray the KubeRay operator and RayService on a CPU‑only minikube cluster (no GPU resources at all). The cluster, Serve controller and pods start successfully, and the application status eventually shows as running. However:

No inference endpoints ever become reachable
No model‑related logs appear (no model loading, no vLLM engine logs, etc.)
Serve replicas remain stuck in an “initializing” state indefinitely
The Ray dashboard logs repeatedly warn that replicas are taking too long to initialize
There are no explicit crashes or errors indicating what is wrong

From the outside, the app appears to be running, but it never becomes usable.

What I tried

Using the official Ray LLM stack
- Started from the official rayproject/ray-llm images
- Used ray.serve.llm.build_openai_app
- Set use_cpu: true and ensured no GPU resources (num_gpus: 0)
- Tested with a very small model (facebook/opt-125m)
Creating a custom CPU‑only Docker image
- Based on rayproject/ray:2.54.0-py311-cpu
- Installed the CPU‑only vLLM wheel (vllm 0.15.1+cpu)
- Installed required system dependencies (gcc, libnuma1, etc.)
- Pinned NumPy to a compatible version
- Verified the image builds and runs correctly

Despite this, the behavior is the same: Ray Serve LLM deployments appear to start but never finish initializing, and no endpoints are exposed.

Question

Is it actually possible to run Ray Serve LLM on a CPU‑only cluster with no GPU resources at all?

More specifically:

Is Ray Serve LLM expected to work on CPU without GPUs, or is GPU availability required/assumed in practice?
Is there any non‑vLLM CPU backend for Ray Serve LLM, or is vLLM mandatory?

I’m mainly trying to understand whether this is:

a misconfiguration on my side, or
a known limitation / unsupported use case of Ray Serve LLM today.

RunLLM · April 30, 2026, 9:51am

Ray Serve LLM can run on CPU-only clusters, but this is still experimental and not officially supported for production. You must use a CPU-only vLLM wheel (e.g., vllm-0.15.1+cpu) and set use_cpu: True in your LLMConfig. However, some users report that even with these settings, deployments may hang or not expose endpoints, as vLLM and Ray Serve LLM are primarily designed for GPU inference and some features may not work as expected on CPU-only clusters. There is no alternative non-vLLM backend for Ray Serve LLM; vLLM is currently mandatory for LLM serving via this API. See the discussion and Dockerfile examples in Ray Issue #56636.

If you need a fully supported, production-grade CPU-only LLM serving solution, Ray Serve LLM may not be suitable at this time. For prototyping, you can try the custom Dockerfile and configuration described in the linked issue, but expect possible limitations or initialization hangs. Would you like a step-by-step guide for setting up a CPU-only Ray Serve LLM deployment, or more detail on the current limitations?

Sources:

Ray Issue #56636

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Ray Serve LLM on KubeRay with CPU Kubernetes	1	17	April 30, 2026
Ray Serve LLM example in document cannot work Ray Serve LLM APIs	6	487	April 3, 2025
Ray Serve Latest version vLLM example requires code modification to work Ray Serve	7	1829	March 17, 2025
vLLM Inferencing on multiGPU Ray Serve	7	1459	September 24, 2024
Ray Serve LLM APIs has 2~3x higher latency Ray Serve LLM APIs	7	434	May 19, 2025

Ray Serve LLM on CPU with KubeRay

Question

Related topics