Ray-worker pod is waiting to start

Mtaibeer · November 7, 2024, 7:45am

I follow the doc to start my cluster with the default yaml configuration kuberay/ray-operator/config/samples/ray-cluster.complete.yaml at master · ray-project/kuberay, worker nodes always display container “ray-worker” in pod “raycluster-complete-large-group-worker-nb2zq” is waiting to start: PodInitializing

Name:         raycluster-complete-large-group-worker-nb2zq
Namespace:    bt
Priority:     0
Node:         node20/10.1.0.29
Start Time:   Thu, 07 Nov 2024 15:32:26 +0800
Labels:       app.kubernetes.io/created-by=kuberay-operator
              app.kubernetes.io/name=kuberay
              ray.io/cluster=raycluster-complete
              ray.io/group=large-group
              ray.io/identifier=raycluster-complete-worker
              ray.io/is-ray-node=yes
              ray.io/node-type=worker
Annotations:  cni.projectcalico.org/containerID: fb3b176adb39869a085a82445a401f4493224187d16e433b50aa87c4590bd0d9
              cni.projectcalico.org/podIP: 10.233.65.80/32
              cni.projectcalico.org/podIPs: 10.233.65.80/32
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "k8s-pod-network",
                    "ips": [
                        "10.233.65.80"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "k8s-pod-network",
                    "ips": [
                        "10.233.65.80"
                    ],
                    "default": true,
                    "dns": {}
                }]
              ray.io/ft-enabled: false
Status:       Pending
IP:           10.233.65.80
IPs:
  IP:           10.233.65.80
Controlled By:  RayCluster/raycluster-complete
Init Containers:
  wait-gcs-ready:
    Container ID:  docker://09bdda9966e5346c121401a67f0a059fd8f7ab95e13a846b48211d2a0357aba0
    Image:         myimage
    Image ID:      docker-pullable://myimage@sha256:abef1b5ef98b8d872f179e6be766f131201f575465db655a622b643ea720fd3a
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -lc
      --
    Args:
      
                            SECONDS=0
                            while true; do
                              if (( SECONDS <= 120 )); then
                                if ray health-check --address raycluster-complete-head-svc.bt.svc.cluster.local:6379 > /dev/null 2>&1; then
                                  echo "GCS is ready."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
                              else
                                if ray health-check --address raycluster-complete-head-svc.bt.svc.cluster.local:6379; then
                                  echo "GCS is ready. Any error messages above can be safely ignored."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md."
                              fi
                              sleep 5
                            done
                          
    State:          Running
      Started:      Thu, 07 Nov 2024 15:32:30 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi
    Environment:
      FQ_RAY_IP:  raycluster-complete-head-svc.bt.svc.cluster.local
      RAY_IP:     raycluster-complete-head-svc
    Mounts:
      /fs/nlp/btfrom ray-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b2hkx (ro)
Containers:
  ray-worker:
    Container ID:  
    Image:         myimage
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -lc
      --
    Args:
      ulimit -n 65536; ray start  --address=raycluster-complete-head-svc.bt.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=10  --memory=21474836480 
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  20Gi
    Requests:
      cpu:      10
      memory:   20Gi
    Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=2s period=5s #success=1 #failure=120
    Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=10s timeout=2s period=5s #success=1 #failure=10
    Environment:
      FQ_RAY_IP:                            raycluster-complete-head-svc.bt.svc.cluster.local
      RAY_IP:                               raycluster-complete-head-svc
      RAY_CLUSTER_NAME:                      (v1:metadata.labels['ray.io/cluster'])
      RAY_CLOUD_INSTANCE_ID:                raycluster-complete-large-group-worker-nb2zq (v1:metadata.name)
      RAY_NODE_TYPE_NAME:                    (v1:metadata.labels['ray.io/group'])
      KUBERAY_GEN_RAY_START_CMD:            ray start  --address=raycluster-complete-head-svc.bt.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=10  --memory=21474836480 
      RAY_PORT:                             6379
      RAY_ADDRESS:                          raycluster-complete-head-svc.bt.svc.cluster.local:6379
      RAY_USAGE_STATS_KUBERAY_IN_USE:       1
      REDIS_PASSWORD:                       
      RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE:  1
    Mounts:
      /dev/shm from shared-mem (rw)
      /fs/nlp/btfrom ray-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b2hkx (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  ray-logs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shared-mem:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  20Gi
  kube-api-access-b2hkx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Chi-Sheng_Liu · November 7, 2024, 8:30am

Well, why your init container image is myimage? Could you try to use the original config first to see if there is an error? Maybe you did something wrong in your custom image.

Mtaibeer · November 8, 2024, 12:46am

Thanks for your reply, because it is my first time to use ray in k8s, I would like to ask if the image here must be a similar image of rayproject/ray:2.38.0? Or a custom image for my project? ray:2.38.0 was installed in my custom image

Chi-Sheng_Liu · November 8, 2024, 8:41am

Unless there’s a special reason, it’s recommended to use the official image.

Mtaibeer · November 11, 2024, 2:34am

Thank you for your suggestion. Now the worker can be started, but why does kubectl not show the worker’s pod when I want to use GPU and I replace

workerGroupSpecs
          resources:
            limits:
              cpu: 10
              memory: 20Gi
            requests:
              cpu: 10
              memory: 20Gi

with

workerGroupSpecs
resources:
            limits:
              nvidia.com/gpu: 8 
              memory: 500Gi 
            requests:
              nvidia.com/gpu: 8 
              memory: 500Gi

Chi-Sheng_Liu · November 11, 2024, 2:03pm

Please see the official k8s documentation for setting up GPU.

Topic		Replies	Views
Worker in pending state when running ray with minikube Kubernetes	3	978	April 13, 2021
Ray cluster does is not creating workers? Ray Core	22	2925	April 26, 2021
Stuck trying to take down workers Kubernetes	22	2385	March 29, 2021
Ray cluster's worker node is pending Ray Clusters	2	1260	February 8, 2022
Worker nodes fail to setup container Ray Clusters	1	713	September 12, 2022

Ray-worker pod is waiting to start

Related topics