Ray Worker pod stuck at init stage and unable to be created

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Version

  • I am currently using Ray version 2.4.0
  • I am using my own custom ray-worker config file.
image:
  repository: registry.gitlab.com/marl3/images/raycluster
  tag: 1.2.1
  pullPolicy: IfNotPresent

nameOverride: kuberay
fullnameOverride: ""

imagePullSecrets:
  - name: gitlab-regcred

head:
  # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
  # Ray autoscaler integration is supported only for Ray versions >= 1.11.0
  # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
  enableInTreeAutoscaling: true
  # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
  # The example configuration shown below below represents the DEFAULT values.
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    securityContext: {}
    env: []
    envFrom: []
    # resources specifies optional resource request and limit overrides for the autoscaler container.
    # For large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
    resources:
       limits:
         cpu: "500m"
         memory: "512Mi"
      requests:
         cpu: "500m"
         memory: "512Mi"
  rayStartParams:
    dashboard-host: "0.0.0.0"
    block: 'true'
  containerEnv:
  - name: RAY_GRAFANA_HOST
    value: "http://prometheus-grafana.prometheus-system:80"
  - name: RAY_GRAFANA_IFRAME_HOST
    value: "http://grafana.127.0.0.1.nip.io:3000"
  - name: RAY_PROMETHEUS_HOST
    value: "http://prometheus-kube-prometheus-prometheus.prometheus-system:9090"
  envFrom: []
  resources:
    limits:
      cpu: "5"
      memory: "4G"
      nvidia.com/gpu: 0
    requests:
      cpu: "5"
      memory: "4G"
      nvidia.com/gpu: 0
  annotations: {}
  nodeSelector: {} 
  #tolerations : []  
  # tolerations to allow Ray head and worker pods to be scheduled on control-plane nodes
  tolerations:
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Exists"
      effect: "NoSchedule"
  affinity: {}
  securityContext: {}
  volumes:
    - name: log-volume
      emptyDir: {}
    - name: promtail-config
      configMap:
            name: promtail-config
  # Ray writes logs to /tmp/ray/session_latests/logs
  volumeMounts:
    - mountPath: /tmp/ray
      name: log-volume
  # container command for head Pod.
  command: []
  args: []

worker:
  replicas: 1
  minReplicas: 1
  maxReplicas: 1
  initContainerImage: "busybox:1.28"
  resources:
    limits:
      cpu: "20"
      memory: "100G"
      nvidia.com/gpu: 0
    requests:
      cpu: "20"
      memory: "100G"
      nvidia.com/gpu: 0

service:
  type: ClusterIP

Description of problem
While using my own custom config ray worker file Ray worker pod stuck on init using our own custom config ray worker file.

Upon running the command kubectl describe pod, it shows that the pod is waiting for gcs and timedout. However, upon trying to troubleshoot the connection to GCS, I am able to reach the GCS server after I ssh into the pod, which suggests that the GCS server is up and running.

Any idea what may be the problem? Thank you.

Are you sure your Ray head pod is running without issues? Its logs may have some useful insights. Allowing it to be scheduled on a control plane node can cause all sort of issues.

Besides, it doesn’t look like a connection error. Worker seems to be timing out waiting for GCS to be ready. As a troubleshooting step, it may be worth trying connecting to the head node to get cluster status or submitting a simple job.

Can you also up to a newer version of Ray 2.4 is quite old; are you on GCP (I think Google might ship with this default…)

Hi lobanov, thanks for the prompt. I was investigating my raycluster again and this what I found.

When I ran the command kubectl get pods the raycluster-kuberay-head pod status showed that it was Running. However, when I checked the events with kubectl describe pod, there are some issues raised inside the pod. for the flannel Network plugin that i am using.

Warning  FailedCreatePodSandBox  8m                       kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "80d5b6227bee7b73d6b810ddd1812b2473f2271848c3b276a1f440eea5216809": plugin type="flannel" failed (add): loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory

/run/flannel/subnet.env does exist on my terminal with the following output:

FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

However, one thing that I noticed is that the Flannel service is not running as either

  1. System d service systemctl list-units --type=service | grep -i flannel
  2. Docker service sudo docker ps | grep flannel
  3. Kubernetes Daemonset kubectl get daemonset -n kube-system
    I ran the commands and did not get any output, or have any services with flannel in it.

Do you think that this could be the issue instead? I installed flannel with the following command.
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml

Hey Sam,

I forgot to mention it earlier, but I am actually on a Bare Metal Kubernetes Cluster and not on GCP. I am using Linux, which unfortunately does not have minikube.

I am stuck on Ray 2.4 at the moment, as there are other scripts which are running on the Ray 2.4 version, scripts that i wish to run on the ray cluster subsequently. The upgrading of dependencies might lead to other breaking changes that I wish to avoid for now.

Hi @ryanquek22,

You probably don’t see flannel daemonset in kube-system namespace because the manifest you are using installs the daemonset into kube-flannel namespace. Generally speaking, incorrectly configured flannel could cause problems because it creates an overlay network for node-to-node communication. However, in that case you would see the problem with any node-to-node traffic, not just Ray. To isolate this you can try running nginx pod and see if you can get to it from another pod running on a different node. If you do, then your overlay network is either working fine, or not affecting the traffic.

As I said earlier, you first probably want to establish if the issue is with node/pod connectivity or if Ray head itself is unhealthy and unable to process requests. Check Ray head logs, not just Kubernetes events, there may be some insightful messages. Then try submitting a trivial job directly onto the head node and connecting to Ray dashboard to see how it gets executed. If all that is working fine, then you can focus on worker connectivity.

Hi @lobanov

Thanks for providing a detailed breakdown of steps to try .

From the Ray Head, the logs below does not suggest any issue.

2024-08-01 10:38:51,011	INFO usage_lib.py:398 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-08-01 10:38:51,011	INFO scripts.py:710 -- Local node IP: 10.244.1.11
2024-08-01 10:38:53,427	SUCC scripts.py:747 -- --------------------
2024-08-01 10:38:53,427	SUCC scripts.py:748 -- Ray runtime started.
2024-08-01 10:38:53,427	SUCC scripts.py:749 -- --------------------
2024-08-01 10:38:53,427	INFO scripts.py:751 -- Next steps
2024-08-01 10:38:53,428	INFO scripts.py:754 -- To add another node to this Ray cluster, run
2024-08-01 10:38:53,428	INFO scripts.py:757 --   ray start --address='10.244.1.11:6379'
2024-08-01 10:38:53,428	INFO scripts.py:766 -- To connect to this Ray cluster:
2024-08-01 10:38:53,428	INFO scripts.py:768 -- import ray
2024-08-01 10:38:53,428	INFO scripts.py:769 -- ray.init()
2024-08-01 10:38:53,428	INFO scripts.py:781 -- To submit a Ray job using the Ray Jobs CLI:
2024-08-01 10:38:53,428	INFO scripts.py:782 --   RAY_ADDRESS='http://10.244.1.11:8265' ray job submit --working-dir . -- python my_script.py
2024-08-01 10:38:53,428	INFO scripts.py:791 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
2024-08-01 10:38:53,428	INFO scripts.py:795 -- for more information on submitting Ray jobs to the Ray cluster.
2024-08-01 10:38:53,428	INFO scripts.py:800 -- To terminate the Ray runtime, run
2024-08-01 10:38:53,428	INFO scripts.py:801 --   ray stop
2024-08-01 10:38:53,428	INFO scripts.py:804 -- To view the status of the cluster, use
2024-08-01 10:38:53,428	INFO scripts.py:805 --   ray status
2024-08-01 10:38:53,428	INFO scripts.py:809 -- To monitor and debug Ray, view the dashboard at 
2024-08-01 10:38:53,428	INFO scripts.py:810 --   10.244.1.11:8265
2024-08-01 10:38:53,428	INFO scripts.py:817 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2024-08-01 10:38:53,428	INFO utils.py:112 -- Overwriting previous Ray address (10.244.1.10:6379). Running ray.init() on this node will now connect to the new instance at 10.244.1.11:6379. To override this behavior, pass address=10.244.1.10:6379 to ray.init().
2024-08-01 10:38:53,428	INFO scripts.py:917 -- --block
2024-08-01 10:38:53,428	INFO scripts.py:918 -- This command will now block forever until terminated by a signal.
2024-08-01 10:38:53,428	INFO scripts.py:921 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

However, in the Ray Head, when I ran the following command, python -c "import ray; ray.init(); the following errors were raised.

2024-08-01 11:01:08,956	INFO worker.py:1314 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2024-08-01 11:01:08,956	INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.244.1.11:6379...
2024-08-01 11:01:08,966	INFO worker.py:1616 -- Connected to Ray cluster. View the dashboard at 10.244.1.11:8265 
2024-08-01 11:01:08,980	WARNING worker.py:1964 -- The autoscaler failed with the following error:
Traceback (most recent call last):
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib64/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x75886ad7c580>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/default/rayclusters/raycluster-kuberay (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x75886ad7c580>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/monitor.py", line 549, in run
    self._initialize_autoscaler()
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/monitor.py", line 233, in _initialize_autoscaler
    self.autoscaler = StandardAutoscaler(
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 251, in __init__
    self.reset(errors_fatal=True)
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1111, in reset
    raise e
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1028, in reset
    new_config = self.config_reader()
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
    ray_cr = self._fetch_ray_cr_from_k8s_with_retries()
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 71, in _fetch_ray_cr_from_k8s_with_retries
    return self._fetch_ray_cr_from_k8s()
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 85, in _fetch_ray_cr_from_k8s
    result = requests.get(
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/default/rayclusters/raycluster-kuberay (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x75886ad7c580>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Further testing via kubectl get svc kubernetes -o wide returns

Attempting to reach the service via curl -k https://kubernetes.default:443 does not work, could not resolve host: kubernetes.default. However, reaching the service via curl -k 10.96.0.1:443 does work, and the client is able to send a HTTP request to a HTTPS server.

Do you think the autoscaler / kubernetes host error could be the issue? I searched the slack channel for previous similar issues, but it seems that it was related to an older version of Ray (1.4), where I am using Ray 2.4 now.

Ok, this is a promising lead. This may indicate a misconfiguration of DNS in the cluster. Hostname kubernetes.default should resolve to the IP address of the API server, so the autoscaler can do its job. I recommend following these troubleshooting steps.

However, this should not be blocking you, because your Ray head node is configured with 5 CPUs and it should be able to run jobs on its own without requiring an autoscaler. Worth trying disabling autoscaler completely and submitting some simple jobs to see if there are any other issues. You don’t need autoscaler in this configuration anyway, because your workers pool has fixed size.

Debugging from the troubleshooting steps has taken more time than I expected. But it was the right direction.

There was an issue with resolving kubernetes.default IP address, and the connection would be refused each time. Resolving that helped in setting up the autoscaler. Having discussed it with some of colleagues, the decision to keep the autoscaler was made as we wanted to be able to scale and apply these configurations to other bigger clusters in the future, without having to configure the Ray Head node each time.

Further inspection revealed that there was an issue with the base dependencies in the file, where using a .rhel file led to the issues above, but reverting back to an ubuntu image led to it being able to run.

Big thanks for your help and guidance @lobanov , it was very much appreciated. Will mark this as the solution. Thank you.

1 Like