Autoscaler scales cluster up and down all the time

Hi,

we are using PPO to train in custom env on local cluster consisting of 4 machines managed by kubernetes. After updating to ray 1.3 we observed peculiar behavior in which cluster for most of the time scales up and down (along whole training, not only in start-up phase). In our setup, PPO is scaled to 180 workers, where each worker consumes 1 cpu. The scaling up & down process looks like that:

(…)
(autoscaler +1m0s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m6s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m12s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m18s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m23s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m29s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m34s) Resized to 187 CPUs, 4 GPUs.
(autoscaler +1m45s) Resized to 63 CPUs, 2 GPUs.
(…)

Do you have any clue from where this behavior may came? Is this some misconfiguration of our training run, some cluster / hardware issue or maybe some issue in code? If any additional information would be required I will be more than happy to provide it.

Thanks in advance for suggestions,
Mateusz

Hello, please check your idle timeout param in the cluster config which will hint autoscaler to take down nodes. Not sure what was your previous ray version, If I understand correctly the autoscaler algorithm changed here is the link to the new autoscaler algorithm: A Glimpse into the Ray Autoscaler by Ameer Haj Ali - YouTube

Yeah, this may be an autoscaler bug.

@Ameer_Haj_Ali @Dmitri can you take a look at this?

I have checked that - idle_timeout_minutes is set to 5 minutes so as you may see from logs it is not expected behavior. Regarding previous version, we were using release 1.2 before.

Could you share the autoscaling config yaml being used?

I guess you mean this file - here it goes:

cluster_name: lpcc
max_workers: 3
upscaling_speed: 1
idle_timeout_minutes: 5
provider:
  type: kubernetes
  use_internal_ips: true
  namespace: ray
  autoscaler_service_account:
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: autoscaler
      namespace: ray
  autoscaler_role:
    kind: Role
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: autoscaler
      namespace: ray
    rules:
      - apiGroups:
          - ''
        resources:
          - pods
          - pods/status
          - pods/exec
        verbs:
          - get
          - watch
          - list
          - create
          - delete
          - patch
  autoscaler_role_binding:
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: autoscaler
      namespace: ray
    subjects:
      - kind: ServiceAccount
        name: autoscaler
        namespace: ray
    roleRef:
      kind: Role
      name: autoscaler
      apiGroup: rbac.authorization.k8s.io
  services:
    - apiVersion: v1
      kind: Service
      metadata:
        name: ray-head
        namespace: ray
      spec:
        selector:
          component: ray-head
        ports:
          - name: client
            protocol: TCP
            port: 10001
            targetPort: 10001
          - name: dashboard
            protocol: TCP
            port: 8265
            targetPort: 8265
          - name: email
            protocol: TCP
            port: 465
            targetPort: 465
head_node_type: head_node
available_node_types:
  head_node:
    max_workers: 0
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-head-
        labels:
          component: ray-head
      spec:
        serviceAccountName: autoscaler
        restartPolicy: Never
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: ray-ssh
            secret:
              secretName: ray-ssh
              defaultMode: 256
          - name: ray-results
            nfs:
              path: /srv/nfs/ray
              server: lpcc-walesa
        containers:
          - name: ray-node
            imagePullPolicy: Always
            image: 'lpcc-walesa:32500/ray-autoscaler:1.3.0'
            command:
              - /bin/bash
              - '-c'
              - '--'
            args:
              - 'trap : TERM INT; sleep infinity & wait;'
            ports:
              - containerPort: 6379
              - containerPort: 10001
              - containerPort: 8265
              - containerPort: 465
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /etc/ray-ssh
                name: ray-ssh
              - mountPath: /srv/nfs/ray
                name: ray-results
            resources:
              requests:
                cpu: 1000m
                memory: 16Gi
                nvidia.com/gpu: 1
              limits:
                cpu: 1000m
                memory: 16Gi
                nvidia.com/gpu: 1
            env:
              - name: MY_CPU_REQUEST
                valueFrom:
                  resourceFieldRef:
                    resource: requests.cpu
              - name: MY_MEMORY_REQUEST
                valueFrom:
                  resourceFieldRef:
                    resource: requests.memory
              - name: http_proxy
                value: 'http://10.159.17.111:3128/'
              - name: https_proxy
                value: 'http://10.159.17.111:3128/'
              - name: no_proxy
                value: >-
                  localhost,127.0.0.1,lpcc-walesa,lpcc-kosciuszko,lpcc-pilsudski,lpcc-maria
    resources:
      CPU: 1
      GPU: 1
      memory: 12025908428
  worker_node:
    min_workers: 3
    max_workers: 3
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-worker-
        labels:
          component: ray-worker
      spec:
        restartPolicy: Never
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: ray-ssh
            secret:
              secretName: ray-ssh
              defaultMode: 256
          - name: ray-results
            nfs:
              path: /srv/nfs/ray
              server: lpcc-walesa
        containers:
          - name: ray-node
            imagePullPolicy: Always
            image: 'lpcc-walesa:32500/ray-autoscaler:1.3.0'
            command:
              - /bin/bash
              - '-c'
              - '--'
            args:
              - 'trap : TERM INT; sleep infinity & wait;'
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /etc/ray-ssh
                name: ray-ssh
              - mountPath: /srv/nfs/ray
                name: ray-results
            resources:
              requests:
                cpu: 62000m
                memory: 80Gi
              limits:
                cpu: 62000m
                memory: 80Gi
            env:
              - name: http_proxy
                value: 'http://10.159.17.111:3128/'
              - name: https_proxy
                value: 'http://10.159.17.111:3128/'
              - name: no_proxy
                value: >-
                  localhost,127.0.0.1,lpcc-walesa,lpcc-kosciuszko,lpcc-pilsudski,lpcc-maria
    resources:
      CPU: 62
      memory: 60129542144
head_start_ray_commands:
  - ray stop
  - >-
    ulimit -n 65536; ray start --head
    --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0
worker_start_ray_commands:
  - ray stop
  - 'ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379'
file_mounts:
  /tmp/refs_to_sync: /tmp/refs_to_sync
cluster_synced_files: []
file_mounts_sync_continuously: false
initialization_commands:
  - sudo cp /etc/ray-ssh/ray $HOME/.ssh/ray
  - 'sudo chown 1000:100 $HOME/.ssh/ray'
setup_commands:
  - >-
    cd $HOME && (test -e aipilot || git clone --recurse-submodules
    git@hpc-gitlab.aptiv.com:aipilot/aipilot.git)
  - >-
    cd $HOME/aipilot && git submodule update && git fetch && git reset --hard &&
    git checkout `cat /tmp/refs_to_sync`
  - cd $HOME/aipilot/scripts && ./get_simulator.sh
  - cd $HOME/aipilot && poetry build
  - >-
    cd $HOME/aipilot/dist && pip install --extra-index-url
    http://pypi.aptiv.today:5873/simple/ --extra-index-url
    http://pypi.aptiv.today:3141/aicore/prod/+simple/ --trusted-host
    pypi.aptiv.today aipilot-*.tar.gz
head_setup_commands:
  - >-
    cd $HOME && (test -e aipilot || git clone --recurse-submodules
    git@hpc-gitlab.aptiv.com:aipilot/aipilot.git)
  - >-
    cd $HOME/aipilot && git submodule update && git fetch && git reset --hard &&
    git checkout `cat /tmp/refs_to_sync`
  - cd $HOME/aipilot/scripts && ./get_simulator.sh
  - cd $HOME/aipilot && poetry build
  - >-
    cd $HOME/aipilot/dist && pip install --extra-index-url
    http://pypi.aptiv.today:5873/simple/ --extra-index-url
    http://pypi.aptiv.today:3141/aicore/prod/+simple/ --trusted-host
    pypi.aptiv.today aipilot-*.tar.gz
worker_setup_commands:
  - >-
    cd $HOME && (test -e aipilot || git clone --recurse-submodules
    git@hpc-gitlab.aptiv.com:aipilot/aipilot.git)
  - >-
    cd $HOME/aipilot && git submodule update && git fetch && git reset --hard &&
    git checkout `cat /tmp/refs_to_sync`
  - cd $HOME/aipilot/scripts && ./get_simulator.sh
  - cd $HOME/aipilot && poetry build
  - >-
    cd $HOME/aipilot/dist && pip install --extra-index-url
    http://pypi.aptiv.today:5873/simple/ --extra-index-url
    http://pypi.aptiv.today:3141/aicore/prod/+simple/ --trusted-host
    pypi.aptiv.today aipilot-*.tar.gz
head_node: {}
worker_nodes: {}
initial_workers: 3
autoscaling_mode: default
target_utilization_fraction: 0.8
auth: {}
no_restart: false

Thanks for sharing the config!
I’m curious whether the problem persists when the head node’s Ray version is upgraded to the latest (as of writing) nightly version.

If upgrading the Ray version doesn’t cause problems for your workflow, that can be achieved by prepending the following to head setup commands:
pip uninstall ray -y && pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/052d2acaee84b5bee8fd772d9d98dd56677d1533/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
That’s assuming you’re using a Python 3.7 image. Substitute the appropriate Python version if using another image.
(Installing Ray — Ray v2.0.0.dev0)

Could also be helpful to see the autoscaling monitor logs – these are /tmp/ray/session_latest/logs/monitor* in the head pod

1 Like