Min_workers doesn't seem to be honored

when i just copy this file (ray/example-full.yaml at master · ray-project/ray · GitHub) and change the value of min_workers (ray/example-full.yaml at 6af02913470fa9f394e0fe37b8a7115e10a29a2b · ray-project/ray · GitHub) to something >0 , ray doesn’t seem to be provisioning the corresponding number of workers and leave them running. in some cases no worker is available after the startup procedure.
When watching what is happening during startup, i can see that for a very short period of time a larger number of pods gets provisioned, but then deprovisioned just a few seconds after. below i included the logs. I tested this on kube 1.18.16, and ray 1.2.

Has anyone else seem this before?

2021-02-24 02:36:04,189	INFO autoscaler.py:203 -- StandardAutoscaler: example-cluster-ray-worker-9rxwz: Terminating outdated node.
2021-02-24 02:36:04,252	INFO autoscaler.py:203 -- StandardAutoscaler: example-cluster-ray-worker-ggx4d: Terminating outdated node.
2021-02-24 02:36:04,316	INFO autoscaler.py:203 -- StandardAutoscaler: example-cluster-ray-worker-n6v72: Terminating outdated node.
2021-02-24 02:36:04,376	INFO autoscaler.py:203 -- StandardAutoscaler: example-cluster-ray-worker-q5czz: Terminating outdated node.
2021-02-24 02:36:04,440	INFO autoscaler.py:203 -- StandardAutoscaler: example-cluster-ray-worker-wjrjj: Terminating outdated node.
2021-02-24 02:36:04,502	INFO autoscaler.py:203 -- StandardAutoscaler: example-cluster-ray-worker-wm7wf: Terminating outdated node.
2021-02-24 02:36:04,541	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,566	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,592	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,614	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,635	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,657	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,680	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,743	INFO autoscaler.py:221 -- StandardAutoscaler: example-cluster-ray-worker-xxsgg: Terminating unneeded node.
2021-02-24 02:36:04,757	INFO autoscaler.py:221 -- StandardAutoscaler: example-cluster-ray-worker-wm7wf: Terminating unneeded node.
2021-02-24 02:36:04,770	INFO autoscaler.py:221 -- StandardAutoscaler: example-cluster-ray-worker-wjrjj: Terminating unneeded node.
2021-02-24 02:36:04,782	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod
2021-02-24 02:36:04,806	INFO node_provider.py:148 -- KubernetesNodeProvider: calling delete_namespaced_pod


kubectl -n ray get pods
NAME                               READY   STATUS        RESTARTS   AGE
example-cluster-ray-head-vs46h     1/1     Running       0          32s
example-cluster-ray-worker-24wzc   1/1     Terminating   0          7s
example-cluster-ray-worker-2rx2w   1/1     Terminating   0          7s
example-cluster-ray-worker-5jgvg   0/1     Terminating   0          7s
example-cluster-ray-worker-5rm7l   1/1     Terminating   0          7s
example-cluster-ray-worker-8l5dp   0/1     Terminating   0          7s
example-cluster-ray-worker-bgsgn   1/1     Terminating   0          7s
example-cluster-ray-worker-dmhrd   1/1     Terminating   0          7s
example-cluster-ray-worker-hbr6j   1/1     Terminating   0          7s
example-cluster-ray-worker-tcwz7   1/1     Terminating   0          7s
example-cluster-ray-worker-www88   1/1     Terminating   0          7s```

@Dmitri could you help take a look here?

I couldn’t replicate the problem:
I tried modifying example-full to have available_node_types.worker_node.min_workers set to 1,
and running ray up with ray 1.2 and kubernetes 1.18.16 .

Could you tell me more about your Kubernetes environment?
Do you know if the rayproject/ray:nightly images in the example config are freshly pulled?

sure – it’s running on VMs with 8 CPUs and 32 GB of memory, the kube installation is setup from scratch.

Are there specific aspects you’d need? The container images are also pulled freshly.

As a reference, I included the cluster yaml exactly as I used it:

cluster_name: example-cluster

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 200

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Kubernetes resources that need to be configured for the autoscaler to be
# able to manage the Ray cluster. If any of the provided resources don't
# exist, the autoscaler will attempt to create them. If this fails, you may
# not have the required permissions and will have to request them to be
# created by your cluster administrator.
provider:
    type: kubernetes

    # Exposing external IP addresses for ray pods isn't currently supported.
    use_internal_ips: true

    # Namespace to use for all resources created.
    namespace: ray

    # ServiceAccount created by the autoscaler for the head node pod that it
    # runs in. If this field isn't provided, the head pod config below must
    # contain a user-created service account with the proper permissions.
    autoscaler_service_account:
        apiVersion: v1
        kind: ServiceAccount
        metadata:
            name: autoscaler

    # Role created by the autoscaler for the head node pod that it runs in.
    # If this field isn't provided, the role referenced in
    # autoscaler_role_binding must exist and have at least these permissions.
    autoscaler_role:
        kind: Role
        apiVersion: rbac.authorization.k8s.io/v1
        metadata:
            name: autoscaler
        rules:
        - apiGroups: [""]
          resources: ["pods", "pods/status", "pods/exec"]
          verbs: ["get", "watch", "list", "create", "delete", "patch"]

    # RoleBinding created by the autoscaler for the head node pod that it runs
    # in. If this field isn't provided, the head pod config below must contain
    # a user-created service account with the proper permissions.
    autoscaler_role_binding:
        apiVersion: rbac.authorization.k8s.io/v1
        kind: RoleBinding
        metadata:
            name: autoscaler
        subjects:
        - kind: ServiceAccount
          name: autoscaler
        roleRef:
            kind: Role
            name: autoscaler
            apiGroup: rbac.authorization.k8s.io

    services:
      # Service that maps to the head node of the Ray cluster.
      - apiVersion: v1
        kind: Service
        metadata:
            # NOTE: If you're running multiple Ray clusters with services
            # on one Kubernetes cluster, they must have unique service
            # names.
            name: example-cluster-ray-head
        spec:
            # This selector must match the head node pod's selector below.
            selector:
                component: example-cluster-ray-head
            ports:
                - name: client
                  protocol: TCP
                  port: 10001
                  targetPort: 10001
                - name: dashboard
                  protocol: TCP
                  port: 8265
                  targetPort: 8265

# Specify the pod type for the ray head node (as configured below).
head_node_type: head_node
# Specify the allowed pod types for this ray cluster and the resources they provide.
available_node_types:
  worker_node:
    # Minimum number of Ray workers of this Pod type.
    min_workers: 50
    # Maximum number of Ray workers of this Pod type. Takes precedence over min_workers.
    max_workers: 80
    # User-specified custom resources for use by Ray. Object with string keys and integer values.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    resources: {"foo": 1, "bar": 2}
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-worker-
      spec:
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:nightly
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 512Mi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 512Mi
  head_node:
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-head-
        # Must match the head node service selector above if a head node
        # service is required.
        labels:
            component: example-cluster-ray-head
      spec:
        # Change this if you altered the autoscaler_service_account above
        # or want to provide your own.
        serviceAccountName: autoscaler

        restartPolicy: Never

        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:nightly
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ['trap : TERM INT; sleep infinity & wait;']
          ports:
          - containerPort: 6379  # Redis port
          - containerPort: 10001  # Used by Ray Client
          - containerPort: 8265  # Used by Ray Dashboard

          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 512Mi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 512Mi


# Command to start ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that kubernetes can port forward.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379```

Thanks – the size of min_workers is definitely a helpful detail.
I’ll see if I can reproduce this by setting min_workers=50.

i’ve been playing around with various min_worker settings and have been seeing inconsistent behavior, e.g when i set min_worker to 10, the system indeed ended up with 10 pods, but when i changed it to 30, I only ended up with one or no worker.

Is there any kind of reproducible set of instructions i can follow to make the min_worker settings work with higher numbers?

This seems like it might be a basic scalability problem with the current state Ray on K8s – I can try to look into it when I have a bit more time.

@rliaw What’s the largest number # worker pods you’ve tried out when running Ray on Kubernetes?

thanks a lot for the quick response. If it helps tracking down the root cause faster, i"ll be happy to have a quick web session to show you things live.

I’ve unfortunately only tried something like 10.

is there a target number you’re planning to support for k8s?

High (>100); this is a bug that we should be fixing soon. Is there an issue on Github that we can reference to track this?

i haven’t opened a github issue yet (since i wasn’t sure whether it was a user error :slight_smile: ), but happy to open up one if you like.

btw – as part of this exercise I think I stumbled over 2 more potential bugs – would be very interested in your take on them:

  1. when i invoke a task many times in parallel, often times there is no provisioning of additional task instances although there is enough free capacity available.

  2. when I annotate a task with `num_CPUs=1’, the system doesn’t seem to regard the capacity available on my kube cluster as being applicable for running instances of this. Only when I define a custom attribute ‘CPU’ in the cluster yaml, that capacity is acknowledged – while i think it should not be required for me to define a custom attribute at all for this.

WDYT?

Wow… nice - could you open up github issues for each of these and tag me (richardliaw)?

Thanks!

Cool, thank you – I just created 3 issues.

1 Like