Connecting to remote Ray cluster on K8s

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Problem:
Disclaimer: Literally just discovered Ray so total newbie here. Trying to see if I can use a Ray Actor as a cache that my ML pipeline can access (would prefer to use Ray instead of a cache/traditional key-value store for relative simplicity). Deployed a Ray cluster on Kubernetes via the operator (default settings); the dashboard launches fine on port 8265. I confirmed that I can access it via web browser at external-address-of-dashboard:8265 (I’m using an Ingress Controller to expose the service.)

Now I’m trying to do ray.init(address=“external-address-of-dashboard:8265”) from my local workstation and I get

2022-09-06 07:14:37,429 WARNING utils.py:1333 – Unable to connect to GCS at my-address:8265. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

The Ray versions are indeed the same (2.0):

from the head pod:
kubectl exec -it raycluster-autoscaler-head-fwvnn -nray – ray --version
Defaulted container “ray-head” out of: ray-head, autoscaler
ray, version 2.0.0

from my workstation:
pip show ray
Name: ray
Version: 2.0.0
Location: …/.pyenv/versions/3.10.3/lib/python3.10/site-packages

It looks like you might want to use Ray Client for this: Ray Client: Interactive Development — Ray 2.0.0

i.e. You’ll need to expose port 10001 of your head node, and then run ray.init(address="ray://external-address:10001").

If you want to test out without setting up an ingress, you can also do kubectl port-forward pod/<head-node-pod> 10001:10001 and do ray.init(address="ray://localhost:10001")

Ah - that narrowed it down some:
2022-09-06 16:02:17,613 INFO client_builder.py:247 – Passing the following kwargs to ray.init() on the server: dashboard_host
Traceback (most recent call last):

RuntimeError: Python minor versions differ between client and server: client is 3.10.3, server is 3.7.7

(Thought the original error was referring to Ray versions…)

I suppose I need to get a different version of the Ray K8s cluster, then? I deployed using the manifest provided here: Getting Started with KubeRay — Ray 3.0.0.dev0

Yeah, you can either change your local Python version to 3.7 or use a different image. It looks like the Getting Started config links to this: https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.autoscaler.yaml

That config has image: rayproject/ray:2.0.0, you can switch to image: rayproject/ray:2.0.0-py310

Updating the image worked for the head node, it appears, but not the worker node - it’s stuck in Running status but never becomes Ready:

Every 2.0s: kubectl get pods --selector=ray.io/cluster=raycluster-autoscaler -nray  49er.intranet.hyperic.net: Tue Sep  6 16:28:11 2022

NAME                                             READY   STATUS     RESTARTS   AGE
raycluster-autoscaler-head-2kv4n                 2/2     Running    0          89s
raycluster-autoscaler-worker-small-group-s9qxn   0/1     Init:0/1   0          89s

Unsure what broke; now the worker node remains in this status even when I rolled back the image to rayproject/ray:2.0.0

Ah hmm, did you update image under both headGroupSpec and workerGroupSpecs?

Yep, here’s my manifest:


# This config demonstrates KubeRay's Ray autoscaler integration.
# The resource requests and limits in this config are too small for production!
# For an example with more realistic resource configuration, see
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster-autoscaler
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  rayVersion: '2.0.0'
  # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
  # Ray autoscaler integration is supported only for Ray versions >= 1.11.0
  # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
  enableInTreeAutoscaling: true
  # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
  # The example configuration shown below below represents the DEFAULT values.
  # (You may delete autoscalerOptions if the defaults are suitable.)
  autoscalerOptions:
    # upscalingMode is "Default" or "Aggressive."
    # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
    # Default: Upscaling is not rate-limited.
    # Aggressive: An alias for Default; upscaling is not rate-limited.
    upscalingMode: Default
    # idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
    idleTimeoutSeconds: 60
    # image optionally overrides the autoscaler's container image.
    # If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
    # the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
    ## image: "my-repo/my-custom-autoscaler-image:tag"
    # imagePullPolicy optionally overrides the autoscaler container's image pull policy.
    imagePullPolicy: Always
    # resources specifies optional resource request and limit overrides for the autoscaler container.
    # For large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  ######################headGroupSpec#################################
  # head group template and specs, (perhaps 'group' is not needed in the name)
  headGroupSpec:
    # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
    serviceType: ClusterIP
    # logical group name, for this called head-group, also can be functional
    # pod type head or worker
    # rayNodeType: head # Not needed since it is under the headgroup
    # the following params are used to complete the ray start: ray start --head --block ...
    rayStartParams:
      # Flag "no-monitor" will be automatically set when autoscaling is enabled.
      dashboard-host: '0.0.0.0'
      block: 'true'
      # num-cpus: '1' # can be auto-completed from the limits
      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the specific format demonstrated below:
      # resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
    #pod template
    template:
      spec:
        containers:
          # The Ray head pod
          - name: ray-head
            image: rayproject/ray:2.0.0-py310
            imagePullPolicy: Always
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
            lifecycle:
              preStop:
                exec:
                  command: ["/bin/sh","-c","ray stop"]
            resources:
              limits:
                cpu: "1"
                memory: "2G"
              requests:
                cpu: "500m"
                memory: "2G"
  workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 300
      # logical group name, for this called small-group, also can be functional
      groupName: small-group
      # if worker pods need to be added, we can simply increment the replicas
      # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
      # the operator will remove pods from the list until the number of replicas is satisfied
      # when a pod is confirmed to be deleted, its name will be removed from the list below
      #scaleStrategy:
      #  workersToDelete:
      #  - raycluster-complete-worker-small-group-bdtwh
      #  - raycluster-complete-worker-small-group-hv457
      #  - raycluster-complete-worker-small-group-k8tj7
      # the following params are used to complete the ray start: ray start --block ...
      rayStartParams:
        block: 'true'
      #pod template
      template:
        metadata:
          labels:
            key: value
          # annotations for pod
          annotations:
            key: value
        spec:
          initContainers:
            # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
            - name: init-myservice
              image: busybox:1.28
              command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
          containers:
            - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
              image: rayproject/ray:2.0.0-py310
              # environment variables to set in the container.Optional.
              # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh","-c","ray stop"]
              resources:
                limits:
                  cpu: "1"
                  memory: "1G"
                requests:
                  cpu: "500m"
                  memory: "1G"

So I dumped the logs of the init-container that wouldn’t start and somehow the raycluster-autoscaler-head-svc.ray.svc.cluster.local service had gotten deleted:

nslookup: can’t resolve ‘raycluster-autoscaler-head-svc.ray.svc.cluster.local’
waiting for myservice
Server: 100.64.0.10
Address 1: 100.64.0.10 kube-dns.kube-system.svc.cluster.local

Deleted the entire setup and started from scratch; now it works fine. Thanks for your help!