Autoscaler scales cluster up and down all the time

Mateusz_Orlowski · May 5, 2021, 7:56am

Hi,

we are using PPO to train in custom env on local cluster consisting of 4 machines managed by kubernetes. After updating to ray 1.3 we observed peculiar behavior in which cluster for most of the time scales up and down (along whole training, not only in start-up phase). In our setup, PPO is scaled to 180 workers, where each worker consumes 1 cpu. The scaling up & down process looks like that:

(…)
(autoscaler +1m0s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m6s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m12s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m18s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m23s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m29s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m34s) Resized to 187 CPUs, 4 GPUs.
(autoscaler +1m45s) Resized to 63 CPUs, 2 GPUs.
(…)

Do you have any clue from where this behavior may came? Is this some misconfiguration of our training run, some cluster / hardware issue or maybe some issue in code? If any additional information would be required I will be more than happy to provide it.

Thanks in advance for suggestions,
Mateusz

asm582 · May 5, 2021, 4:03pm

Hello, please check your idle timeout param in the cluster config which will hint autoscaler to take down nodes. Not sure what was your previous ray version, If I understand correctly the autoscaler algorithm changed here is the link to the new autoscaler algorithm: A Glimpse into the Ray Autoscaler by Ameer Haj Ali - YouTube

rliaw · May 5, 2021, 5:49pm

Yeah, this may be an autoscaler bug.

@Ameer_Haj_Ali @Dmitri can you take a look at this?

Mateusz_Orlowski · May 6, 2021, 7:02am

I have checked that - idle_timeout_minutes is set to 5 minutes so as you may see from logs it is not expected behavior. Regarding previous version, we were using release 1.2 before.

Dmitri · May 6, 2021, 1:13pm

Could you share the autoscaling config yaml being used?

Mateusz_Orlowski · May 12, 2021, 9:27am

I guess you mean this file - here it goes:

cluster_name: lpcc
max_workers: 3
upscaling_speed: 1
idle_timeout_minutes: 5
provider:
  type: kubernetes
  use_internal_ips: true
  namespace: ray
  autoscaler_service_account:
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: autoscaler
      namespace: ray
  autoscaler_role:
    kind: Role
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: autoscaler
      namespace: ray
    rules:
      - apiGroups:
          - ''
        resources:
          - pods
          - pods/status
          - pods/exec
        verbs:
          - get
          - watch
          - list
          - create
          - delete
          - patch
  autoscaler_role_binding:
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: autoscaler
      namespace: ray
    subjects:
      - kind: ServiceAccount
        name: autoscaler
        namespace: ray
    roleRef:
      kind: Role
      name: autoscaler
      apiGroup: rbac.authorization.k8s.io
  services:
    - apiVersion: v1
      kind: Service
      metadata:
        name: ray-head
        namespace: ray
      spec:
        selector:
          component: ray-head
        ports:
          - name: client
            protocol: TCP
            port: 10001
            targetPort: 10001
          - name: dashboard
            protocol: TCP
            port: 8265
            targetPort: 8265
          - name: email
            protocol: TCP
            port: 465
            targetPort: 465
head_node_type: head_node
available_node_types:
  head_node:
    max_workers: 0
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-head-
        labels:
          component: ray-head
      spec:
        serviceAccountName: autoscaler
        restartPolicy: Never
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: ray-ssh
            secret:
              secretName: ray-ssh
              defaultMode: 256
          - name: ray-results
            nfs:
              path: /srv/nfs/ray
              server: lpcc-walesa
        containers:
          - name: ray-node
            imagePullPolicy: Always
            image: 'lpcc-walesa:32500/ray-autoscaler:1.3.0'
            command:
              - /bin/bash
              - '-c'
              - '--'
            args:
              - 'trap : TERM INT; sleep infinity & wait;'
            ports:
              - containerPort: 6379
              - containerPort: 10001
              - containerPort: 8265
              - containerPort: 465
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /etc/ray-ssh
                name: ray-ssh
              - mountPath: /srv/nfs/ray
                name: ray-results
            resources:
              requests:
                cpu: 1000m
                memory: 16Gi
                nvidia.com/gpu: 1
              limits:
                cpu: 1000m
                memory: 16Gi
                nvidia.com/gpu: 1
            env:
              - name: MY_CPU_REQUEST
                valueFrom:
                  resourceFieldRef:
                    resource: requests.cpu
              - name: MY_MEMORY_REQUEST
                valueFrom:
                  resourceFieldRef:
                    resource: requests.memory
              - name: http_proxy
                value: 'http://10.159.17.111:3128/'
              - name: https_proxy
                value: 'http://10.159.17.111:3128/'
              - name: no_proxy
                value: >-
                  localhost,127.0.0.1,lpcc-walesa,lpcc-kosciuszko,lpcc-pilsudski,lpcc-maria
    resources:
      CPU: 1
      GPU: 1
      memory: 12025908428
  worker_node:
    min_workers: 3
    max_workers: 3
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-worker-
        labels:
          component: ray-worker
      spec:
        restartPolicy: Never
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: ray-ssh
            secret:
              secretName: ray-ssh
              defaultMode: 256
          - name: ray-results
            nfs:
              path: /srv/nfs/ray
              server: lpcc-walesa
        containers:
          - name: ray-node
            imagePullPolicy: Always
            image: 'lpcc-walesa:32500/ray-autoscaler:1.3.0'
            command:
              - /bin/bash
              - '-c'
              - '--'
            args:
              - 'trap : TERM INT; sleep infinity & wait;'
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /etc/ray-ssh
                name: ray-ssh
              - mountPath: /srv/nfs/ray
                name: ray-results
            resources:
              requests:
                cpu: 62000m
                memory: 80Gi
              limits:
                cpu: 62000m
                memory: 80Gi
            env:
              - name: http_proxy
                value: 'http://10.159.17.111:3128/'
              - name: https_proxy
                value: 'http://10.159.17.111:3128/'
              - name: no_proxy
                value: >-
                  localhost,127.0.0.1,lpcc-walesa,lpcc-kosciuszko,lpcc-pilsudski,lpcc-maria
    resources:
      CPU: 62
      memory: 60129542144
head_start_ray_commands:
  - ray stop
  - >-
    ulimit -n 65536; ray start --head
    --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0
worker_start_ray_commands:
  - ray stop
  - 'ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379'
file_mounts:
  /tmp/refs_to_sync: /tmp/refs_to_sync
cluster_synced_files: []
file_mounts_sync_continuously: false
initialization_commands:
  - sudo cp /etc/ray-ssh/ray $HOME/.ssh/ray
  - 'sudo chown 1000:100 $HOME/.ssh/ray'
setup_commands:
  - >-
    cd $HOME && (test -e aipilot || git clone --recurse-submodules
    git@hpc-gitlab.aptiv.com:aipilot/aipilot.git)
  - >-
    cd $HOME/aipilot && git submodule update && git fetch && git reset --hard &&
    git checkout `cat /tmp/refs_to_sync`
  - cd $HOME/aipilot/scripts && ./get_simulator.sh
  - cd $HOME/aipilot && poetry build
  - >-
    cd $HOME/aipilot/dist && pip install --extra-index-url
    http://pypi.aptiv.today:5873/simple/ --extra-index-url
    http://pypi.aptiv.today:3141/aicore/prod/+simple/ --trusted-host
    pypi.aptiv.today aipilot-*.tar.gz
head_setup_commands:
  - >-
    cd $HOME && (test -e aipilot || git clone --recurse-submodules
    git@hpc-gitlab.aptiv.com:aipilot/aipilot.git)
  - >-
    cd $HOME/aipilot && git submodule update && git fetch && git reset --hard &&
    git checkout `cat /tmp/refs_to_sync`
  - cd $HOME/aipilot/scripts && ./get_simulator.sh
  - cd $HOME/aipilot && poetry build
  - >-
    cd $HOME/aipilot/dist && pip install --extra-index-url
    http://pypi.aptiv.today:5873/simple/ --extra-index-url
    http://pypi.aptiv.today:3141/aicore/prod/+simple/ --trusted-host
    pypi.aptiv.today aipilot-*.tar.gz
worker_setup_commands:
  - >-
    cd $HOME && (test -e aipilot || git clone --recurse-submodules
    git@hpc-gitlab.aptiv.com:aipilot/aipilot.git)
  - >-
    cd $HOME/aipilot && git submodule update && git fetch && git reset --hard &&
    git checkout `cat /tmp/refs_to_sync`
  - cd $HOME/aipilot/scripts && ./get_simulator.sh
  - cd $HOME/aipilot && poetry build
  - >-
    cd $HOME/aipilot/dist && pip install --extra-index-url
    http://pypi.aptiv.today:5873/simple/ --extra-index-url
    http://pypi.aptiv.today:3141/aicore/prod/+simple/ --trusted-host
    pypi.aptiv.today aipilot-*.tar.gz
head_node: {}
worker_nodes: {}
initial_workers: 3
autoscaling_mode: default
target_utilization_fraction: 0.8
auth: {}
no_restart: false

Dmitri · May 12, 2021, 6:40pm

Thanks for sharing the config!
I’m curious whether the problem persists when the head node’s Ray version is upgraded to the latest (as of writing) nightly version.

If upgrading the Ray version doesn’t cause problems for your workflow, that can be achieved by prepending the following to head setup commands:
pip uninstall ray -y && pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/052d2acaee84b5bee8fd772d9d98dd56677d1533/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
That’s assuming you’re using a Python 3.7 image. Substitute the appropriate Python version if using another image.
(Installing Ray — Ray v2.0.0.dev0)

Could also be helpful to see the autoscaling monitor logs – these are /tmp/ray/session_latest/logs/monitor* in the head pod

Topic		Replies	Views
Testing autoscaler Kubernetes	15	1549	March 16, 2021
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1338	June 9, 2021
EC2 Autoscaler starts scaling down while scaling up	7	34	February 21, 2025
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	729	November 20, 2023
Autoscaler does not seem to watch head node Kubernetes	5	736	March 26, 2021

Autoscaler scales cluster up and down all the time

Related topics