Autoscaler not shutting down idle nodes. ray 1.3

Jennice_Tao · April 28, 2021, 6:24pm

I started using ray 1.3, and notice that autoscaler does not honor idle_timeout_minutes in the yaml anymore. My worker nodes are alive after many hours after the my task is done. It used to get terminated automatically.

asm582 · April 28, 2021, 6:35pm

Hello, not sure about the root cause of this error but try deploying cluster with --no-config-cache it may fix your autoscaling issue

Ameer_Haj_Ali · April 28, 2021, 6:39pm

cc @Dmitri , this seems like a serious issue if true.

Dmitri · April 28, 2021, 7:45pm

Hi @Jennice_Tao!

Could you provide more details about your cluster configuration and Ray workload?

If you’re able to get a minimal reproduction of the problem, it would be super helpful if you could post that in a bug report issue on the Ray github.

Jennice_Tao · April 28, 2021, 11:17pm

I use the following option and my yaml didn’t change while I updated ray from 1.2 to 1.3. I use AWS. My worker nodes are always on and not get killed once created.

If a node is idle for this many minutes, it will be removed.

idle_timeout_minutes: 4

Jennice_Tao · April 28, 2021, 11:20pm

I ran a job hours ago, and was expected my worker nodes were terminated. Now I start my cluster again, and see the nodes alive. The first lines of monitor.err and monitor.out

less monitor.err

Warning: Permanently added '172.31.21.128' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.31.25.160' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.31.18.159' (ECDSA) to the list of known hosts.

less monitor.out
2021-04-28 23:14:49,594 INFO log_timer.py:27 -- AWSNodeProvider: Set tag ray-node-status=waiting-for-ssh on ['i-0d13e044b111', 'i-00070202355f09', 'i-0219397ef9a8d3'] [LogTimer=140ms]

Jennice_Tao · April 28, 2021, 11:39pm

I didn’t shutdown the cluster (head node) in between two tasks, expecting that the autoscaler would shut down workers. I had ~40 workers running idle unnoticed. I didn’t have any issue in ray 1.2

Dmitri · April 29, 2021, 2:54am

I see, thank you – I’ll try to reproduce on AWS with Ray 1.3.

Dmitri · April 30, 2021, 1:34am

I’m not yet able to reproduce the problem.
Could you provide the full config yaml used to set up the cluster?

Jennice_Tao · May 2, 2021, 3:32pm

good to know that it at least works for you. Here is the config with some private parts removed (like ami id and etc).

An unique identifier for the head node and workers of this cluster.

max_workers: 20

The autoscaler will scale up the cluster faster with higher upscaling speed.

E.g., if the task requires adding more nodes then autoscaler will gradually

scale up the cluster in chunks of upscaling_speed*currently_running_nodes.

This number should be > 0.

upscaling_speed: 2.0

If a node is idle for this many minutes, it will be removed.

idle_timeout_minutes: 4

Cloud-provider specific configuration.

provider:
type: aws
region: us-east-2
availability_zone: us-east-2b, us-east-2a, us-east-2c
cache_stopped_nodes: true

How Ray will authenticate with newly launched nodes.

auth:
ssh_user: centos

Tell the autoscaler the allowed node types and the resources they provide.

The key is the name of the node type, which is just for debugging purposes.

The node config specifies the launch config and physical instance type.

available_node_types:
cpu_4_ondemand:
node_config:
InstanceType: r5a.xlarge
max_workers: 0

resources: {“head”: 1.}

cpu_48_spot:
node_config:
InstanceType: c5.24xlarge
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: “2.0”
CpuOptions:
CoreCount: 48
ThreadsPerCore: 1
resources: {“CPU”: 48, “is_spot”: 1, “small_memory”: 1, “object_store_memory”: 141807662592, “worker”: 1.}
max_workers: 40
worker_setup_commands:

pip3.7 install -U --user ray==1.3.0
sudo mount -t tmpfs -o size=133G tmpfs /dev/shm

large_memory_cpu_48_spot:
node_config:
InstanceType: m5.24xlarge
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: “2.0”
CpuOptions:
CoreCount: 48
ThreadsPerCore: 1
resources: { “CPU”: 48, “is_spot”: 1, “large_memory”: 1, “object_store_memory”: 347966092800, “worker”: 1.}
max_workers: 40
worker_setup_commands:

pip3.7 install -U --user ray==1.3.0
sudo mount -t tmpfs -o size=325G tmpfs /dev/shm

head_node_type: cpu_4_ondemand

Files or directories to copy to the head and worker nodes. The format is a

dictionary from REMOTE_PATH: LOCAL_PATH, e.g.

file_mounts:
{
…
}

file_mounts_sync_continuously: false

List of commands that will be run before `setup_commands`. If docker is

enabled, these commands will run outside the container and before docker

is setup.

initialization_commands: []

Custom commands that will be run on the head node after common setup.

head_setup_commands:

pip3.7 install -U --user ray==1.3.0

Command to start ray on the head node. You don’t need to change this.

head_start_ray_commands:

ray stop
ulimit -n 65536; RAY_BACKEND_LOG_LEVEL=error ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:

ray stop
#- source /home/centos/bts/env/bts_setenv.rc
ulimit -n 65535; RAY_BACKEND_LOG_LEVEL=error ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Dmitri · May 2, 2021, 8:08pm

Thanks for the extra details.

I tried launching with a similar config and scaling up with some dummy tasks – I’m still not able to reproduce the issue, unfortunately. Scale down happens as expected after the idle period…

If possible could you post log output from
/tmp/ray/session_latest/logs/monitor.*
Also, could you share some details about the workload you are running?

So far, it seems to me that the auto-downscaling itself is functioning correctly. It’s possible something in the Ray internals is preventing the nodes in your cluster from being correctly marked idle.

Jennice_Tao · May 29, 2021, 2:18pm

Hi,

This is still happening to me, but it seems to only happen when my jobs fail due to exceptions or errors.

I create placement groups. 2) run a task on each placement group. 3) the tasks raise exceptions. 4) a large cluster left idle.

I still have no problems with 1.2 though. only happens in 1.3

Thank you!

Jennice_Tao · May 30, 2021, 4:32pm

When everything finishes correctly, the autoscaler does scale down.
If there is a failed task, the autoscaler does not work.

Dmitri · May 30, 2021, 5:55pm

Ah, thanks! Involvement of placement groups and task failures are helpful details.
I wonder what kind of exception is raised. For example, do you see the problem if one of the tasks internally raises a Python exception:

@ray.remote(placement_group=pg)
def fail():
    raise

Could you provide some details on how you’re configuring the placement groups (e.g. placement group strategy, bundles requested), how the tasks are defined, and how the Ray program is deployed?

A detailed minimal reproduction of the problem would be super helpful.
(I know such a repro takes some time and effort to construct…)

@sangcho might have some general insight into stuff related to placement groups.
@Alex might know about autoscaler interaction with placement groups.

sangcho · May 31, 2021, 2:57am

Yeah it looks like

repro
logs from monitor (and maybe ray status, which is printing monitor logs)

could help us figuring out what blocks the autoscaler to be scaled down.

@Dmitri as a feature request, we can probably also log why nodes are up and running in ray. That sort of observability will help us investigating this type of issues.

mataney · June 7, 2021, 8:14am

I have the same problem, and as well, using ray 1.3.0.
Trying to reproduce.

mataney · June 7, 2021, 11:36am

Here are the config.yaml and code that reproduce this problem.
What do you think is the reason for this? What can be done?

cluster_name: matan
max_workers: 10
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: ai2-israel
auth:
    ssh_user: ray
available_node_types:
    head_node:
        min_workers: 0
        max_workers: 1
        resources: {"CPU": 1}
        node_config:
            machineType: n1-highmem-2
            tags:
              - items: ["allow-all"]
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu-debian-9
    worker_node:
        min_workers: 0
        resources: {"CPU": 2}
        node_config:
            machineType: n1-highmem-2
            tags:
              - items: ["allow-all"]
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu-debian-9
            scheduling:
              - preemptible: false
head_node_type: head_node

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

import ray
import logging
from time import sleep
import ray
import ray.autoscaler.sdk

logger = logging.getLogger(__file__)

@ray.remote(num_cpus=1.0, max_restarts=-1, max_task_retries=-1)
class Reproduce(object):
    def run(self):
        for i in range(10, -1, -1):
            sleep(1)
        return 4

ray.init(address="auto")
ray.autoscaler.sdk.request_resources(num_cpus=10)

arr = []
for _ in range(40):
    c = Reproduce.remote()
    arr.append(c.run.remote())

print(ray.get(arr))

Dmitri · June 7, 2021, 2:46pm

In the case described in the last message, not downscaling is expected.

The request for 10 cpus becomes permanent until overridden by a request for 0 cpus. (I also find this confusing.)

mataney · June 8, 2021, 6:44am

Are you saying that because I explicitly wrote

ray.autoscaler.sdk.request_resources(num_cpus=10)

I should do something like

ray.autoscaler.sdk.request_resources(num_cpus=0)

Dmitri · June 8, 2021, 6:10pm

Indeed: Autoscaler SDK — Ray v2.0.0.dev0
Note:
"This request is persistent until another call to request_resources() is made to override."

The effect is similar to setting min_workers in a cluster config.

Topic		Replies	Views
How to disable Autoscaler for local cluster Ray Clusters	9	672	March 16, 2023
EC2 Autoscaler starts scaling down while scaling up	7	34	February 21, 2025
Autoscaler not removing idle workers Ray Clusters	2	718	April 12, 2023
Ray Cluster Not Scaling Down	7	760	May 4, 2023
[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Kubernetes	0	37	September 10, 2024

Autoscaler not shutting down idle nodes. ray 1.3

If a node is idle for this many minutes, it will be removed.

An unique identifier for the head node and workers of this cluster.

The autoscaler will scale up the cluster faster with higher upscaling speed.

E.g., if the task requires adding more nodes then autoscaler will gradually

scale up the cluster in chunks of upscaling_speed*currently_running_nodes.

This number should be > 0.

If a node is idle for this many minutes, it will be removed.

Cloud-provider specific configuration.

How Ray will authenticate with newly launched nodes.

Tell the autoscaler the allowed node types and the resources they provide.

The key is the name of the node type, which is just for debugging purposes.

The node config specifies the launch config and physical instance type.

Files or directories to copy to the head and worker nodes. The format is a

dictionary from REMOTE_PATH: LOCAL_PATH, e.g.

List of commands that will be run before setup_commands. If docker is

enabled, these commands will run outside the container and before docker

is setup.

Custom commands that will be run on the head node after common setup.

Command to start ray on the head node. You don’t need to change this.

Related topics

List of commands that will be run before `setup_commands`. If docker is