I’m trying to connect to my Ray cluster from another pod in the same Kubernetes cluster as described here.
When attempting to connect to the Ray head service within my script with ray.init("ray://<cluster-name>-ray-head:10001")
I get the following:
Traceback (most recent call last):
File "/home/sabri/code/domino/scratch/sabri/09-01_train_slices_gqa.py", line 6, in <module>
ray.init("ray://ray-t4-1-cluster-ray-head:10001")
File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/worker.py", line 718, in init
redis_address, _, _ = services.validate_redis_address(address)
File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/_private/services.py", line 362, in validate_redis_address
redis_address = address_to_ip(address)
File "/home/common/envs/conda/envs/domino/lib/python3.8/site-packages/ray/_private/services.py", line 394, in address_to_ip
ip_address = socket.gethostbyname(address_parts[0])
socket.gaierror: [Errno -2] Name or service not known
The job manifest I’m using looks like:
# Job to submit a Ray program from a pod outside a running Ray cluster.
apiVersion: batch/v1
kind: Job
metadata:
name: ray-test-job
spec:
template:
spec:
restartPolicy: Never
containers:
- name: ray
image: rayproject/ray:latest-py38
imagePullPolicy: Always
command: [ "/bin/bash", "-c", "--" ]
args:
- "source /pd/sabri/ray-startup.sh"
resources:
requests:
cpu: 100m
memory: 512Mi
volumeMounts:
- name: pv-1 # replace this with the name of the persistent volume you want to mount
mountPath: /pd # this will mount the volume pv-1 at /home
- name: dshm
mountPath: /dev/shm
volumes:
- name: pv-1 # replace this with the name of the persistent volume you want to mount
persistentVolumeClaim:
claimName: pvc-1 # replace this with the name of the persistent volume claim
- name: dshm
emptyDir:
medium: Memory
What might be causing this error when trying to connect?