When you say “this was the latest response returned” what did you do to get that response? Was this with ray.init(address="ray://ray.com:443")? That does look promising though, it looks like you’re hitting a gRPC server, just with an unexpected method.
If you want to avoid having to remove SSL/TLS you can do something like ray.init(address="ray://ray.com:443", _credentials=grpc.ssl_channel_credentials())
Also just to avoid confusion, the health check’s /package.service/method can be left to the default for client (in workaround article from earlier, it looks like they just left it as /). If you’re using ClusterInfo, I’m not sure what error code it would return (it would technically be implemented, but the health check might use unexpected arguments/gRPC context which would cause it to fail.)
That is correct! I was just trying to connect to ray.com:80 and that was the response that was returned. I don’t think that the _credentials parameter exists in ray.init(), from what I’m seeing it only exists in the depreciated client util.
I have also tried switching to a network load balancer service and I am getting the same error. I’m running the KubeRay 0.3.0 operator and Ray 2.0.0. What is the best way for me to debug this? I have also tried using Traefik’s h2c schema and the suggested method with a nginx-controller. It would be awesome if I could get this to work and I would happily document the process if I do.
Yes and no, it is currently working with a network load balancer with no SSL. We added to the configuration below to enable autoscaler and leverage Karpenter nodes. The key thing to note in the NLB definition is the selection of the ray head node. “ray.io/identifier: raycluster-complete-head”. Depending on your use case and since you already have the dashboard endpoint exposed you may be able to use Ray Job - Python SDK — Ray 2.0.0 instead. Our use case seemed better suited for interactive development though, which is why we needed to expose the client. That use case being a better local development experience for data scientists who need more compute or want to do distributed processing/training. We also use PrefectHQ/prefect-ray: Prefect integrations with Ray (github.com) which requires access to the client.
If you have any questions around the EKS setup itself let me know.
ray-complete.yaml
# The resource requests and limits in this config are too small for production!
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
labels:
controller-tools.k8s.io: "1.0"
# A unique identifier for the head node and workers of this cluster.
name: raycluster-complete
spec:
rayVersion: '2.0.0'
######################headGroupSpec#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: NodePort
# for the head group, replicas should always be 1.
# headGroupSpec.replicas is deprecated in KubeRay >= 0.3.0.
replicas: 1
# the following params are used to complete the ray start: ray start --head --block --dashboard-host: '0.0.0.0' ...
rayStartParams:
dashboard-host: '0.0.0.0'
block: 'true'
#pod template
template:
metadata:
labels:
# custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
rayCluster: raycluster-sample # will be injected if missing
rayNodeType: head # will be injected if missing, must be head or wroker
groupName: headgroup # will be injected if missing
# annotations for pod
annotations:
key: value
spec:
containers:
- name: ray-head
image: rayproject/ray:2.0.0-py39-cpu
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
memory: "2G"
requests:
cpu: "500m"
memory: "1G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
# logical group name, for this called large-group, also can be functional
groupName: large-group
# if worker pods need to be added, we can simply increment the replicas
# if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
# the operator will remove pods from the list until the number of replicas is satisfied
# when a pod is confirmed to be deleted, its name will be removed from the list below
#scaleStrategy:
# workersToDelete:
# - raycluster-complete-worker-large-group-bdtwh
# - raycluster-complete-worker-large-group-hv457
# - raycluster-complete-worker-large-group-k8tj7
# the following params are used to complete the ray start: ray start --block
rayStartParams:
block: 'true'
#pod template
template:
metadata:
labels:
rayCluster: raycluster-complete # will be injected if missing
rayNodeType: worker # will be injected if missing
groupName: small-group # will be injected if missing
spec:
containers:
- name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:2.0.0-py39-cpu
# environment variables to set in the container.Optional.
# Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
# use volumeMounts.Optional.
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "500m"
memory: "256Mi"
initContainers:
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
- name: init-myservice
image: busybox:1.28
# Change the cluster postfix if you don't have a default setting
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
# use volumes
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}