1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.54.0
- Python version: 3.11
- OS: WSL Ubuntu
- Cloud/Infrastructure:
- Kubernetes Minikube (KubeRay / RayService CRD)
- CPU‑only cluster (no GPUs available)
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: Ray Serve LLM (ray.serve.llm:build_openai_app) to run on KubeRay on a CPU only Minikube Cluster
- Actual: Unable to run.
4. Description/Question
I’m trying to run Ray Serve LLM (ray.serve.llm) on a CPU‑only Kubernetes cluster (minikube) for experimental/prototyping reasons, but I haven’t been able to get it to fully work.
I deployed Ray the KubeRay operator and RayService on a CPU‑only minikube cluster (no GPU resources at all). The cluster, Serve controller and pods start successfully, and the application status eventually shows as running. However:
-
No inference endpoints ever become reachable
-
No model‑related logs appear (no model loading, no vLLM engine logs, etc.)
-
Serve replicas remain stuck in an “initializing” state indefinitely
-
The Ray dashboard logs repeatedly warn that replicas are taking too long to initialize
-
There are no explicit crashes or errors indicating what is wrong
From the outside, the app appears to be running, but it never becomes usable.
What I tried
-
Using the official Ray LLM stack
-
Started from the official
rayproject/ray-llmimages -
Used
ray.serve.llm.build_openai_app -
Set
use_cpu: trueand ensured no GPU resources (num_gpus: 0) -
Tested with a very small model (
facebook/opt-125m)
-
-
Creating a custom CPU‑only Docker image
-
Based on
rayproject/ray:2.54.0-py311-cpu -
Installed the CPU‑only vLLM wheel (
vllm 0.15.1+cpu) -
Installed required system dependencies (gcc, libnuma1, etc.)
-
Pinned NumPy to a compatible version
-
Verified the image builds and runs correctly
-
Despite this, the behavior is the same: Ray Serve LLM deployments appear to start but never finish initializing, and no endpoints are exposed.
Question
Is it actually possible to run Ray Serve LLM on a CPU‑only cluster with no GPU resources at all?
More specifically:
-
Is Ray Serve LLM expected to work on CPU without GPUs, or is GPU availability required/assumed in practice?
-
Is there any non‑vLLM CPU backend for Ray Serve LLM, or is vLLM mandatory?
I’m mainly trying to understand whether this is:
-
a misconfiguration on my side, or
-
a known limitation / unsupported use case of Ray Serve LLM today.