Starting Ray on RKE2 Does Not work

Hi, I’m currently trying to get ray working on rke2.

Therefore, I executed the following steps:

helm install kuberay-operator kuberay/kuberay-operator --version 1.3.0
kubectl get pods
# wait until  kuberay-operator-5c7f84f8bc-4cctv is running
helm install raycluster kuberay/ray-cluster --version 1.3.0
kubectl get rayclusters
kubectl get pods --selector=ray.io/cluster=raycluster-kuberay

Unfortunately, the pods are in CrashLoopBackOff:

NAME READY STATUS RESTARTS AGE
raycluster-kuberay-head-ln9pc 0/1 CrashLoopBackOff 6 (3m2s ago) 10m
raycluster-kuberay-workergroup-worker-fns67 0/1 Running 4 (2m1s ago) 10m

The head says:

kubectl logs -f raycluster-kuberay-head-ln9pc
[2025-03-26 02:16:11,872 W 1 1] global_state_accessor.cc:463: Retrying to get node with node ID 21251e62483a5baeb5b73052e95c778edae610ae4c0a174318af68fe
2025-03-26 02:16:08,555 INFO usage_lib.py:467 – Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable th
is, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See DOCS for more details.
2025-03-26 02:16:08,555 INFO scripts.py:865 – Local node IP: 10.42.1.101
2025-03-26 02:16:12,875 SUCC scripts.py:902 – --------------------
2025-03-26 02:16:12,875 SUCC scripts.py:903 – Ray runtime started.
2025-03-26 02:16:12,875 SUCC scripts.py:904 – --------------------
2025-03-26 02:16:12,875 INFO scripts.py:906 – Next steps
2025-03-26 02:16:12,876 INFO scripts.py:909 – To add another node to this Ray cluster, run
2025-03-26 02:16:12,876 INFO scripts.py:912 – ray start --address=‘10.42.1.101:6379’
2025-03-26 02:16:12,876 INFO scripts.py:921 – To connect to this Ray cluster:
2025-03-26 02:16:12,876 INFO scripts.py:923 – import ray
2025-03-26 02:16:12,876 INFO scripts.py:924 – ray.init()
2025-03-26 02:16:12,876 INFO scripts.py:936 – To submit a Ray job using the Ray Jobs CLI:
2025-03-26 02:16:12,876 INFO scripts.py:937 – RAY_ADDRESS=‘XXX:8265’ ray job submit --working-dir . – python my_script.py
2025-03-26 02:16:12,876 INFO scripts.py:946 – See DOCS
2025-03-26 02:16:12,876 INFO scripts.py:950 – for more information on submitting Ray jobs to the Ray cluster. 2025-03-26 02:16:12,876 INFO scripts.py:955 – To terminate the Ray runtime, run
2025-03-26 02:16:12,876 INFO scripts.py:956 – ray stop
2025-03-26 02:16:12,876 INFO scripts.py:959 – To view the status of the cluster, use
2025-03-26 02:16:12,876 INFO scripts.py:960 – ray status
2025-03-26 02:16:12,876 INFO scripts.py:964 – To monitor and debug Ray, view the dashboard at
2025-03-26 02:16:12,876 INFO scripts.py:965 – 10.42.1.101:8265
2025-03-26 02:16:12,876 INFO scripts.py:972 – If connection to the dashboard fails, check your firewall settings and network configuration.
2025-03-26 02:16:12,877 INFO scripts.py:1076 – --block
2025-03-26 02:16:12,877 INFO scripts.py:1077 – This command will now block forever until terminated by a signal.
2025-03-26 02:16:12,877 INFO scripts.py:1080 – Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be
treated as graceful, thus NOT reported.

I started the same stuff twice, but don’t see any reason why ray is failing here. I’m using v1.31.5+rke2r1 and helm 3.16.3, so this should fulfill the requirements.

Describe contains

Containers:
ray-head:
Container ID: containerd://94d1623e11765f3b0641fde25a25faf35770f39ecc0036c6d67010e851f6bf3d
Image: rayproject/ray:2.41.0
Image ID: docker.io/rayproject/ray@sha256:d57d976a11ba6ef2e2eb6184e8c5523db4b47d1b86b1036d255359824c8d40a0
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/bash
-lc

Args:
ulimit -n 65536; ray start --head --block --dashboard-agent-listen-port=52365 --dashboard-host=0.0.0.0 --memory=2000000000 --metrics-export-port=8080 --num-cpus=1

and

Events:
Type Reason Age From Message


Normal Scheduled 13m default-scheduler Successfully assigned default/raycluster-kuberay-head-ln9pc to clara29.sc.uni-leipzig.de
Normal Pulled 10m (x5 over 13m) kubelet Container image “rayproject/ray:2.41.0” already present on machine
Normal Created 10m (x5 over 13m) kubelet Created container ray-head
Normal Started 10m (x5 over 13m) kubelet Started container ray-head
Warning BackOff 2m55s (x48 over 12m) kubelet Back-off restarting failed container ray-head in pod raycluster-kuberay-head-ln9pc_default(d536eb85-3310-4f62-8ed9-699f634253e5)

Might the resource limit be a problem?

My goal is to run vllm on multiple nodes on rke2 (in the future using nvidia-gpuoperator), would appreciate any hints how to get ray started correctly.

EDIT: The links in the logs are partially removed, otherwise, the system wouldn’t have accepted the post.