[Cluster] [K8] Is the client.server automatically started in Ray 1.2.0 when running on K8?

Hi,

I deployed Ray on a K8 cluster using the ray-ml image (+ a manual install of xgboost_ray). The pods are running v1.2.0. I used the cluster example provided in the GitHub repo.

Pods start normally (1 head and 2 workers) but if I redirect the head pod ports 10001 and 8265, there’s nothing listening at the pod side, i.e. all attempts to connect with ray.util.connect("127.0.0.1:50051") timeout (as well as the attempt to access the dashboard)

When I manually initiate Ray (by logging into the head and running python -m ray.util.client.server) get to connect to the client-server and execute some basic node but this way of doing it is not useful because I don’t get access to the dashboard (this gets directed to 127.0.0.1:8265 instead of 0.0.0.0:8265 by the ray.util.client.server script) and after a few mins, the connection to the pod times out and dies.

Looks like the following section from the example cluster deployment .yaml file is not having any effect:

# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
      headStartRayCommands:
        - ray stop
        - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0
      # Commands to start Ray on worker nodes. You don't need to change this.
      workerStartRayCommands:
        - ray stop
        - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 

What’d be the right way to fix this? Am I supposed to use Ray 2.0.0dev0 instead of v1.2 to get a smoother experience with K8?

Here, the .yml file used to deploy the cluster

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: example-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 3
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: head-node
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
  - name: head-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 0
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-head-
        labels:
          component: example-cluster-ray-head
      spec:
        restartPolicy: Never

        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: ecorro/ray-ml
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ['trap : TERM INT; sleep infinity & wait;']
          ports:
          - containerPort: 6379  # Redis port
          - containerPort: 10001  # Used by Ray Client
          - containerPort: 8265  # Used by Ray Dashboard

          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 6Gi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 12Gi
  - name: worker-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 2
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 3
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    rayResources: {"foo": 1, "bar": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-worker-
      spec:
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: ecorro/ray-ml
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 6Gi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 8Gi
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

The automatic server start feature was introduced after the Ray 1.2.0 release.

To start the server on port 10001 with Ray 1.2.0, you can append the argument to the head’s Ray start command --ray-client-server-port 10001 :

ray start --head --no-monitor --dashboard-host 0.0.0.0 --ray-client-server-port 10001

1 Like