Hi everyone!
I am kind of new to ray and kubernetes and have spent more time than I’d like to admit on trying to solve this issue. I have a feeling that it should be resolvable easily, but that I’m just looking at the wrong place…hopefully somebody here has an answer or can point me in the right direction
Situation
I am trying to start a RayCluster on K8s by following the detailed quickstart guide here: RayCluster Quickstart — Ray 2.7.1. After setting up everything, I tried to submit a job from within a dummy pod, that I created from the following manifest:
kind: Pod
apiVersion: v1
metadata:
name: dummy-pod
spec:
containers:
- name: dummy-pod
image: python:3.9
command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5 ; done"]
restartPolicy: Never
after connecting to the dummy pod with
kubectl exec --stdin --tty dummy-pod -- /bin/bash
and installing ray[default]
, a job can be submitted successfully to the RayCluster easily with
ray job submit --address http://$HEADSERVICE:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
where $HEADSERVICE
is the name of the service associated with the RayCluster Head pod. Up until here everything is all fun an games
Problem
I modified the helm chart that is used to set up the cluster from the original to use a different Docker image on the head node, that is required for running the final workload. There are other changes as well resulting in the following manifest (generated with helm install --dry-run --debug chart/
):
NAME: XXX
LAST DEPLOYED: Tue Oct 10 16:27:14 2023
NAMESPACE: default
STATUS: pending-install
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
raycluster:
enabled: true
COMPUTED VALUES:
additionalWorkerGroups:
smallGroup:
affinity: {}
annotations: {}
args: []
command: []
containerEnv: []
disabled: true
envFrom: []
maxReplicas: 3
minReplicas: 0
nodeSelector: {}
rayStartParams: {}
replicas: 0
securityContext: {}
serviceAccountName: ""
sidecarContainers: []
tolerations: []
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
volumes:
- emptyDir: {}
name: log-volume
bucket:
name: XXX
fullnameOverride: ""
head:
affinity: {}
annotations: {}
containerEnv: []
envFrom: []
nodeSelector: {}
rayStartParams:
dashboard-host: 0.0.0.0
resources:
limits:
cpu: "1"
memory: 2G
requests:
cpu: "1"
memory: 2G
securityContext: {}
serviceAccountName: ""
sidecarContainers: []
tolerations: []
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
volumes:
- emptyDir: {}
name: log-volume
imagePullSecrets: []
kubernetesClusterDomain: cluster.local
nameOverride: kuberay
project:
name: XXX
raycluster:
enabled: true
rayimage:
pullPolicy: IfNotPresent
repository: rayproject/ray
tag: 2.7.0.cf4a87-py39
rayworker:
affinity: {}
annotations: {}
args: []
command: []
containerEnv: []
envFrom: []
groupName: workergroup
nodeSelector: {}
rayStartParams: {}
replicas: 1
securityContext: {}
serviceAccountName: ""
sidecarContainers: []
tolerations: []
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
volumes:
- emptyDir: {}
name: log-volume
XXXimage:
repository: gcr.io/XXX
tag: latest
redis:
port: 6380
server: redis-server
service:
headService: {}
type: ClusterIP
timeout: 3600
webport: 5000
worker:
resources:
limits:
cpu: "4"
memory: 4G
requests:
cpu: "4"
memory: 4G
HOOKS:
MANIFEST:
---
# Source: chart/templates/web.yaml
apiVersion: v1
kind: Service
metadata:
name: web
labels:
apptype: web
spec:
type: LoadBalancer
selector:
apptype: web
ports:
- name: web
port: 5000
protocol: TCP
---
# Source: chart/templates/web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
labels:
apptype: web
spec:
replicas: 1
selector:
matchLabels:
apptype: web
template:
metadata:
labels:
apptype: web
spec:
containers:
- name: web
image: XXX
env:
- name: REDIS_SERVER
value: "redis-server"
- name: REDIS_PORT
value: "6380"
- name: TIMEOUT
value: "3600"
- name: PROJECT_NAME
value: XXX
- name: BUCKET_NAME
value: XXX
command: ["gunicorn"]
args: ["--bind", "0.0.0.0:5000", "--chdir", "XXX", "--timeout", "3600", "api:app"]
ports:
- containerPort: 5000
name: web
protocol: TCP
---
# Source: chart/templates/raycluster.yaml
# https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/templates/raycluster-cluster.yaml
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
labels:
app.kubernetes.io/name: kuberay
helm.sh/chart: chart-0.1.0
app.kubernetes.io/instance: XXX
app.kubernetes.io/version: "0.1.0"
app.kubernetes.io/managed-by: Helm
name: XXX
spec:
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
imagePullSecrets:
[]
containers:
- volumeMounts:
- mountPath: /tmp/ray
name: log-volume
name: ray-head
image: XXX
imagePullPolicy: Always
resources:
limits:
cpu: "1"
memory: 2G
requests:
cpu: "1"
memory: 2G
securityContext:
{}
env:
[]
volumes:
- emptyDir: {}
name: log-volume
affinity:
{}
tolerations:
[]
nodeSelector:
{}
metadata:
annotations:
{}
labels:
apptype: rayhead
workerGroupSpecs:
- rayStartParams:
{}
replicas: 1
minReplicas: 0
maxReplicas: 2147483647
groupName: workergroup
template:
spec:
imagePullSecrets:
[]
containers:
- volumeMounts:
- mountPath: /tmp/ray
name: log-volume
name: ray-worker
image: rayproject/ray:2.7.0.cf4a87-py39
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "4"
memory: 4G
requests:
cpu: "4"
memory: 4G
securityContext:
{}
env:
[]
ports:
null
volumes:
- emptyDir: {}
name: log-volume
affinity:
{}
tolerations:
[]
nodeSelector:
{}
metadata:
annotations:
{}
labels:
apptype: rayworker
After installing this helm chart, connecting to the dummy-pod
and repeating the ray job submit
command from above, I get the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 203, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 496, in _make_request
conn.request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 395, in request
self.endheaders()
File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
self.send(msg)
File "/usr/local/lib/python3.9/http/client.py", line 980, in send
self.connect()
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 243, in connect
self.sock = self._new_conn()
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 218, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7f35386880>: Failed to establish a new connection: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 844, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=$HEADSERVICE, port=8265): Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f35386880>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 262, in _check_connection_and_version_with_url
r = self._do_request("GET", url)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 303, in _do_request
return requests.request(
File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host=$HEADSERVICE, port=8265): Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f35386880>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/ray", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 253, in submit
client = _get_sdk_client(
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 29, in _get_sdk_client
client = JobSubmissionClient(
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 110, in __init__
self._check_connection_and_version(
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
self._check_connection_and_version_with_url(min_version, version_error_message)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 278, in _check_connection_and_version_with_url
raise ConnectionError(
ConnectionError: Failed to connect to Ray at address: http://$HEADSERVICE:8265.
For some reason, it is not possible to contact the $HEADSERVICE
from the dummy pod after making these changes to the helm chart. I cannot figure out where the problem might be and would be tremendously greatful for any hints.