`socket.gaierror` when connecting from within a Kubernetes cluster

I’m trying to connect to my Ray cluster from another pod in the same Kubernetes cluster as described here.

When attempting to connect to the Ray head service within my script with ray.init("ray://<cluster-name>-ray-head:10001") I get the following:

Traceback (most recent call last):
  File "/home/sabri/code/domino/scratch/sabri/09-01_train_slices_gqa.py", line 6, in <module>
    ray.init("ray://ray-t4-1-cluster-ray-head:10001")
  File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/worker.py", line 718, in init
    redis_address, _, _ = services.validate_redis_address(address)
  File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/_private/services.py", line 362, in validate_redis_address
    redis_address = address_to_ip(address)
  File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/_private/services.py", line 394, in address_to_ip
    ip_address = socket.gethostbyname(address_parts[0])
socket.gaierror: [Errno -2] Name or service not known

The job manifest I’m using looks like:

# Job to submit a Ray program from a pod outside a running Ray cluster.
apiVersion: batch/v1
kind: Job
metadata:
  name: ray-test-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: ray
          image: rayproject/ray:latest-py38
          imagePullPolicy: Always
          command: [ "/bin/bash", "-c", "--" ]
          args:
            - "source /pd/sabri/ray-startup.sh"
          resources:
            requests:
              cpu: 100m
              memory: 512Mi
          volumeMounts:
            - name: pv-1 # replace this with the name of the persistent volume you want to mount
              mountPath: /pd # this will mount the volume pv-1 at /home
            - name: dshm
              mountPath: /dev/shm
      volumes:
        - name: pv-1 # replace this with the name of the persistent volume you want to mount 
          persistentVolumeClaim:
            claimName: pvc-1 # replace this with the name of the persistent volume claim 
        - name: dshm
          emptyDir:
            medium: Memory

What might be causing this error when trying to connect?

My ray version was 1.4 – upgrading to 1.6.0 solved this issue for me.

Thanks to this note in the ray documentation:

If you encounter socket.gaierror: [Errno -2] Name or service not known when using ray.init(“ray://…”) then you may be on a version of Ray prior to 1.5 that does not support starting client connections through ray.init. If this is the case, see the 1.4.1 docs for Ray client.