Unable to connect to Ray Cluster

When you say “this was the latest response returned” what did you do to get that response? Was this with ray.init(address="ray://ray.com:443")? That does look promising though, it looks like you’re hitting a gRPC server, just with an unexpected method.

If you want to avoid having to remove SSL/TLS you can do something like ray.init(address="ray://ray.com:443", _credentials=grpc.ssl_channel_credentials())

Also just to avoid confusion, the health check’s /package.service/method can be left to the default for client (in workaround article from earlier, it looks like they just left it as /). If you’re using ClusterInfo, I’m not sure what error code it would return (it would technically be implemented, but the health check might use unexpected arguments/gRPC context which would cause it to fail.)

That is correct! I was just trying to connect to ray.com:80 and that was the response that was returned. I don’t think that the _credentials parameter exists in ray.init(), from what I’m seeing it only exists in the depreciated client util.

I have also tried switching to a network load balancer service and I am getting the same error. I’m running the KubeRay 0.3.0 operator and Ray 2.0.0. What is the best way for me to debug this? I have also tried using Traefik’s h2c schema and the suggested method with a nginx-controller. It would be awesome if I could get this to work and I would happily document the process if I do.

Did you manage to get it to work?

Yes and no, it is currently working with a network load balancer with no SSL. We added to the configuration below to enable autoscaler and leverage Karpenter nodes. The key thing to note in the NLB definition is the selection of the ray head node. “ray.io/identifier: raycluster-complete-head”. Depending on your use case and since you already have the dashboard endpoint exposed you may be able to use Ray Job - Python SDK — Ray 2.0.0 instead. Our use case seemed better suited for interactive development though, which is why we needed to expose the client. That use case being a better local development experience for data scientists who need more compute or want to do distributed processing/training. We also use PrefectHQ/prefect-ray: Prefect integrations with Ray (github.com) which requires access to the client.

If you have any questions around the EKS setup itself let me know.

ray-complete.yaml

# The resource requests and limits in this config are too small for production!
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster-complete
spec:
  rayVersion: '2.0.0'
  ######################headGroupSpec#################################
  # head group template and specs, (perhaps 'group' is not needed in the name)
  headGroupSpec:
    # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
    serviceType: NodePort
    # for the head group, replicas should always be 1.
    # headGroupSpec.replicas is deprecated in KubeRay >= 0.3.0.
    replicas: 1
    # the following params are used to complete the ray start: ray start --head --block --dashboard-host: '0.0.0.0' ...
    rayStartParams:
      dashboard-host: '0.0.0.0'
      block: 'true'
    #pod template
    template:
      metadata:
        labels:
          # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
          # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
          rayCluster: raycluster-sample # will be injected if missing
          rayNodeType: head # will be injected if missing, must be head or wroker
          groupName: headgroup # will be injected if missing
        # annotations for pod
        annotations:
          key: value
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.0.0-py39-cpu
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              cpu: "500m"
              memory: "1G"
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    # logical group name, for this called large-group, also can be functional
    groupName: large-group
    # if worker pods need to be added, we can simply increment the replicas
    # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
    # the operator will remove pods from the list until the number of replicas is satisfied
    # when a pod is confirmed to be deleted, its name will be removed from the list below
    #scaleStrategy:
    #  workersToDelete:
    #  - raycluster-complete-worker-large-group-bdtwh
    #  - raycluster-complete-worker-large-group-hv457
    #  - raycluster-complete-worker-large-group-k8tj7
    # the following params are used to complete the ray start: ray start --block
    rayStartParams:
      block: 'true'
    #pod template
    template:
      metadata:
        labels:
          rayCluster: raycluster-complete # will be injected if missing
          rayNodeType: worker # will be injected if missing
          groupName: small-group # will be injected if missing
      spec:
        containers:
        - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
          image: rayproject/ray:2.0.0-py39-cpu
          # environment variables to set in the container.Optional.
          # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # use volumeMounts.Optional.
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
          resources:
            limits:
              cpu: "1"
              memory: "512Mi"
            requests:
              cpu: "500m"
              memory: "256Mi"
        initContainers:
        # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
        - name: init-myservice
          image: busybox:1.28
          # Change the cluster postfix if you don't have a default setting
          command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
        # use volumes
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}

ray-nlb.yaml


  apiVersion: v1
  kind: Service
  metadata:
    name: ray-lb
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-name: ray-cluster-nlb
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
      service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true
  spec:
    type: LoadBalancer
    ports:
    - name: ray-api
      port: 10001
      protocol: TCP
      targetPort: 10001
    - name: ray-dashboard
      port: 8265
      protocol: TCP
      targetPort: 8265
    selector:
      ray.io/cluster: raycluster-complete
      ray.io/identifier: raycluster-complete-head
      ray.io/node-type: head