Unable to send job to RayCluster from within K8s pod

Hi everyone!

I am kind of new to ray and kubernetes and have spent more time than I’d like to admit on trying to solve this issue. I have a feeling that it should be resolvable easily, but that I’m just looking at the wrong place…hopefully somebody here has an answer or can point me in the right direction :smiley:

Situation
I am trying to start a RayCluster on K8s by following the detailed quickstart guide here: RayCluster Quickstart — Ray 2.7.1. After setting up everything, I tried to submit a job from within a dummy pod, that I created from the following manifest:

kind: Pod
apiVersion: v1
metadata:
  name: dummy-pod
spec:
  containers:
    - name: dummy-pod
      image: python:3.9
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5 ; done"]
  restartPolicy: Never

after connecting to the dummy pod with

kubectl exec --stdin --tty dummy-pod -- /bin/bash

and installing ray[default], a job can be submitted successfully to the RayCluster easily with

ray job submit --address http://$HEADSERVICE:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

where $HEADSERVICE is the name of the service associated with the RayCluster Head pod. Up until here everything is all fun an games :wink:

Problem
I modified the helm chart that is used to set up the cluster from the original to use a different Docker image on the head node, that is required for running the final workload. There are other changes as well resulting in the following manifest (generated with helm install --dry-run --debug chart/):

NAME: XXX
LAST DEPLOYED: Tue Oct 10 16:27:14 2023
NAMESPACE: default
STATUS: pending-install
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
raycluster:
  enabled: true

COMPUTED VALUES:
additionalWorkerGroups:
  smallGroup:
    affinity: {}
    annotations: {}
    args: []
    command: []
    containerEnv: []
    disabled: true
    envFrom: []
    maxReplicas: 3
    minReplicas: 0
    nodeSelector: {}
    rayStartParams: {}
    replicas: 0
    securityContext: {}
    serviceAccountName: ""
    sidecarContainers: []
    tolerations: []
    volumeMounts:
    - mountPath: /tmp/ray
      name: log-volume
    volumes:
    - emptyDir: {}
      name: log-volume
bucket:
  name: XXX
fullnameOverride: ""
head:
  affinity: {}
  annotations: {}
  containerEnv: []
  envFrom: []
  nodeSelector: {}
  rayStartParams:
    dashboard-host: 0.0.0.0
  resources:
    limits:
      cpu: "1"
      memory: 2G
    requests:
      cpu: "1"
      memory: 2G
  securityContext: {}
  serviceAccountName: ""
  sidecarContainers: []
  tolerations: []
  volumeMounts:
  - mountPath: /tmp/ray
    name: log-volume
  volumes:
  - emptyDir: {}
    name: log-volume
imagePullSecrets: []
kubernetesClusterDomain: cluster.local
nameOverride: kuberay
project:
  name: XXX
raycluster:
  enabled: true
rayimage:
  pullPolicy: IfNotPresent
  repository: rayproject/ray
  tag: 2.7.0.cf4a87-py39
rayworker:
  affinity: {}
  annotations: {}
  args: []
  command: []
  containerEnv: []
  envFrom: []
  groupName: workergroup
  nodeSelector: {}
  rayStartParams: {}
  replicas: 1
  securityContext: {}
  serviceAccountName: ""
  sidecarContainers: []
  tolerations: []
  volumeMounts:
  - mountPath: /tmp/ray
    name: log-volume
  volumes:
  - emptyDir: {}
    name: log-volume
XXXimage:
  repository: gcr.io/XXX
  tag: latest
redis:
  port: 6380
  server: redis-server
service:
  headService: {}
  type: ClusterIP
timeout: 3600
webport: 5000
worker:
  resources:
    limits:
      cpu: "4"
      memory: 4G
    requests:
      cpu: "4"
      memory: 4G

HOOKS:
MANIFEST:
---
# Source: chart/templates/web.yaml
apiVersion: v1
kind: Service
metadata:
  name: web
  labels:
    apptype: web
spec:
  type: LoadBalancer
  selector:
    apptype: web
  ports:
  - name: web
    port: 5000
    protocol: TCP
---
# Source: chart/templates/web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  labels:
    apptype: web
spec:
  replicas: 1
  selector:
    matchLabels:
      apptype: web
  template:
    metadata:
      labels:
        apptype: web
    spec:
      containers:
      - name: web
        image: XXX
        env:
        - name: REDIS_SERVER
          value: "redis-server"
        - name: REDIS_PORT
          value: "6380"
        - name: TIMEOUT
          value: "3600"
        - name: PROJECT_NAME
          value: XXX
        - name: BUCKET_NAME
          value: XXX
        command: ["gunicorn"]
        args: ["--bind", "0.0.0.0:5000", "--chdir", "XXX", "--timeout", "3600", "api:app"]
        ports:
        - containerPort: 5000
          name: web
          protocol: TCP
---
# Source: chart/templates/raycluster.yaml
# https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/templates/raycluster-cluster.yaml
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    app.kubernetes.io/name: kuberay
    helm.sh/chart: chart-0.1.0
    app.kubernetes.io/instance: XXX
    app.kubernetes.io/version: "0.1.0"
    app.kubernetes.io/managed-by: Helm
  name: XXX
  
spec:
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
        dashboard-host: "0.0.0.0"
    template:
      spec:
        imagePullSecrets:
          []
        containers:
          - volumeMounts:
            - mountPath: /tmp/ray
              name: log-volume
            name: ray-head
            image: XXX
            imagePullPolicy: Always
            resources:
              limits:
                cpu: "1"
                memory: 2G
              requests:
                cpu: "1"
                memory: 2G
            securityContext:
              {}
            env:
              []
        volumes:
          - emptyDir: {}
            name: log-volume
        affinity:
          {}
        tolerations:
          []
        nodeSelector:
          {}
      metadata:
        annotations:
          {}
        labels:
          apptype: rayhead

  workerGroupSpecs:
  - rayStartParams:
      {}
    replicas: 1
    minReplicas: 0
    maxReplicas: 2147483647
    groupName: workergroup
    template:
      spec:
        imagePullSecrets:
          []
        containers:
          - volumeMounts:
            - mountPath: /tmp/ray
              name: log-volume
            name: ray-worker
            image: rayproject/ray:2.7.0.cf4a87-py39
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "4"
                memory: 4G
              requests:
                cpu: "4"
                memory: 4G
            securityContext:
              {}
            env:
              []
            ports:
              null
        volumes:
          - emptyDir: {}
            name: log-volume
        affinity:
          {}
        tolerations:
          []
        nodeSelector:
          {}
      metadata:
        annotations:
          {}
        labels:
          apptype: rayworker

After installing this helm chart, connecting to the dummy-pod and repeating the ray job submit command from above, I get the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 496, in _make_request
    conn.request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 395, in request
    self.endheaders()
  File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 243, in connect
    self.sock = self._new_conn()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 218, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7f35386880>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=$HEADSERVICE, port=8265): Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f35386880>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 262, in _check_connection_and_version_with_url
    r = self._do_request("GET", url)
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 303, in _do_request
    return requests.request(
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host=$HEADSERVICE, port=8265): Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f35386880>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 253, in submit
    client = _get_sdk_client(
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 29, in _get_sdk_client
    client = JobSubmissionClient(
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 110, in __init__
    self._check_connection_and_version(
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
    self._check_connection_and_version_with_url(min_version, version_error_message)
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 278, in _check_connection_and_version_with_url
    raise ConnectionError(
ConnectionError: Failed to connect to Ray at address: http://$HEADSERVICE:8265.

For some reason, it is not possible to contact the $HEADSERVICE from the dummy pod after making these changes to the helm chart. I cannot figure out where the problem might be and would be tremendously greatful for any hints.