I follow the doc to start my cluster with the default yaml configuration kuberay/ray-operator/config/samples/ray-cluster.complete.yaml at master · ray-project/kuberay, worker nodes always display container “ray-worker” in pod “raycluster-complete-large-group-worker-nb2zq” is waiting to start: PodInitializing
Name: raycluster-complete-large-group-worker-nb2zq
Namespace: bt
Priority: 0
Node: node20/10.1.0.29
Start Time: Thu, 07 Nov 2024 15:32:26 +0800
Labels: app.kubernetes.io/created-by=kuberay-operator
app.kubernetes.io/name=kuberay
ray.io/cluster=raycluster-complete
ray.io/group=large-group
ray.io/identifier=raycluster-complete-worker
ray.io/is-ray-node=yes
ray.io/node-type=worker
Annotations: cni.projectcalico.org/containerID: fb3b176adb39869a085a82445a401f4493224187d16e433b50aa87c4590bd0d9
cni.projectcalico.org/podIP: 10.233.65.80/32
cni.projectcalico.org/podIPs: 10.233.65.80/32
k8s.v1.cni.cncf.io/network-status:
[{
"name": "k8s-pod-network",
"ips": [
"10.233.65.80"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "k8s-pod-network",
"ips": [
"10.233.65.80"
],
"default": true,
"dns": {}
}]
ray.io/ft-enabled: false
Status: Pending
IP: 10.233.65.80
IPs:
IP: 10.233.65.80
Controlled By: RayCluster/raycluster-complete
Init Containers:
wait-gcs-ready:
Container ID: docker://09bdda9966e5346c121401a67f0a059fd8f7ab95e13a846b48211d2a0357aba0
Image: myimage
Image ID: docker-pullable://myimage@sha256:abef1b5ef98b8d872f179e6be766f131201f575465db655a622b643ea720fd3a
Port: <none>
Host Port: <none>
Command:
/bin/bash
-lc
--
Args:
SECONDS=0
while true; do
if (( SECONDS <= 120 )); then
if ray health-check --address raycluster-complete-head-svc.bt.svc.cluster.local:6379 > /dev/null 2>&1; then
echo "GCS is ready."
break
fi
echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
else
if ray health-check --address raycluster-complete-head-svc.bt.svc.cluster.local:6379; then
echo "GCS is ready. Any error messages above can be safely ignored."
break
fi
echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md."
fi
sleep 5
done
State: Running
Started: Thu, 07 Nov 2024 15:32:30 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Environment:
FQ_RAY_IP: raycluster-complete-head-svc.bt.svc.cluster.local
RAY_IP: raycluster-complete-head-svc
Mounts:
/fs/nlp/btfrom ray-logs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b2hkx (ro)
Containers:
ray-worker:
Container ID:
Image: myimage
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/bash
-lc
--
Args:
ulimit -n 65536; ray start --address=raycluster-complete-head-svc.bt.svc.cluster.local:6379 --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365 --num-cpus=10 --memory=21474836480
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 10
memory: 20Gi
Requests:
cpu: 10
memory: 20Gi
Liveness: exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=2s period=5s #success=1 #failure=120
Readiness: exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=10s timeout=2s period=5s #success=1 #failure=10
Environment:
FQ_RAY_IP: raycluster-complete-head-svc.bt.svc.cluster.local
RAY_IP: raycluster-complete-head-svc
RAY_CLUSTER_NAME: (v1:metadata.labels['ray.io/cluster'])
RAY_CLOUD_INSTANCE_ID: raycluster-complete-large-group-worker-nb2zq (v1:metadata.name)
RAY_NODE_TYPE_NAME: (v1:metadata.labels['ray.io/group'])
KUBERAY_GEN_RAY_START_CMD: ray start --address=raycluster-complete-head-svc.bt.svc.cluster.local:6379 --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365 --num-cpus=10 --memory=21474836480
RAY_PORT: 6379
RAY_ADDRESS: raycluster-complete-head-svc.bt.svc.cluster.local:6379
RAY_USAGE_STATS_KUBERAY_IN_USE: 1
REDIS_PASSWORD:
RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE: 1
Mounts:
/dev/shm from shared-mem (rw)
/fs/nlp/btfrom ray-logs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b2hkx (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
ray-logs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
shared-mem:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 20Gi
kube-api-access-b2hkx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>