Autoscaler not shutting down idle nodes. ray 1.3

@Ameer_Haj_Ali

I started using ray 1.3, and notice that autoscaler does not honor idle_timeout_minutes in the yaml anymore. My worker nodes are alive after many hours after the my task is done. It used to get terminated automatically.

Hello, not sure about the root cause of this error but try deploying cluster with --no-config-cache it may fix your autoscaling issue

cc @Dmitri , this seems like a serious issue if true.

Hi @Jennice_Tao!

Could you provide more details about your cluster configuration and Ray workload?

If you’re able to get a minimal reproduction of the problem, it would be super helpful if you could post that in a bug report issue on the Ray github.

I use the following option and my yaml didn’t change while I updated ray from 1.2 to 1.3. I use AWS. My worker nodes are always on and not get killed once created.

If a node is idle for this many minutes, it will be removed.

idle_timeout_minutes: 4

I ran a job hours ago, and was expected my worker nodes were terminated. Now I start my cluster again, and see the nodes alive. The first lines of monitor.err and monitor.out

less monitor.err

Warning: Permanently added '172.31.21.128' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.31.25.160' (ECDSA) to the list of known hosts.
Warning: Permanently added '172.31.18.159' (ECDSA) to the list of known hosts.
less monitor.out
2021-04-28 23:14:49,594 INFO log_timer.py:27 -- AWSNodeProvider: Set tag ray-node-status=waiting-for-ssh on ['i-0d13e044b111', 'i-00070202355f09', 'i-0219397ef9a8d3'] [LogTimer=140ms]

I didn’t shutdown the cluster (head node) in between two tasks, expecting that the autoscaler would shut down workers. I had ~40 workers running idle unnoticed. I didn’t have any issue in ray 1.2

I see, thank you – I’ll try to reproduce on AWS with Ray 1.3.

I’m not yet able to reproduce the problem.
Could you provide the full config yaml used to set up the cluster?

good to know that it at least works for you. Here is the config with some private parts removed (like ami id and etc).

An unique identifier for the head node and workers of this cluster.

max_workers: 20

The autoscaler will scale up the cluster faster with higher upscaling speed.

E.g., if the task requires adding more nodes then autoscaler will gradually

scale up the cluster in chunks of upscaling_speed*currently_running_nodes.

This number should be > 0.

upscaling_speed: 2.0

If a node is idle for this many minutes, it will be removed.

idle_timeout_minutes: 4

Cloud-provider specific configuration.

provider:
type: aws
region: us-east-2
availability_zone: us-east-2b, us-east-2a, us-east-2c
cache_stopped_nodes: true

How Ray will authenticate with newly launched nodes.

auth:
ssh_user: centos

Tell the autoscaler the allowed node types and the resources they provide.

The key is the name of the node type, which is just for debugging purposes.

The node config specifies the launch config and physical instance type.

available_node_types:
cpu_4_ondemand:
node_config:
InstanceType: r5a.xlarge
max_workers: 0

resources: {“head”: 1.}

cpu_48_spot:
node_config:
InstanceType: c5.24xlarge
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: “2.0”
CpuOptions:
CoreCount: 48
ThreadsPerCore: 1
resources: {“CPU”: 48, “is_spot”: 1, “small_memory”: 1, “object_store_memory”: 141807662592, “worker”: 1.}
max_workers: 40
worker_setup_commands:

  • pip3.7 install -U --user ray==1.3.0
  • sudo mount -t tmpfs -o size=133G tmpfs /dev/shm

large_memory_cpu_48_spot:
node_config:
InstanceType: m5.24xlarge
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: “2.0”
CpuOptions:
CoreCount: 48
ThreadsPerCore: 1
resources: { “CPU”: 48, “is_spot”: 1, “large_memory”: 1, “object_store_memory”: 347966092800, “worker”: 1.}
max_workers: 40
worker_setup_commands:

  • pip3.7 install -U --user ray==1.3.0
  • sudo mount -t tmpfs -o size=325G tmpfs /dev/shm

head_node_type: cpu_4_ondemand

Files or directories to copy to the head and worker nodes. The format is a

dictionary from REMOTE_PATH: LOCAL_PATH, e.g.

file_mounts:
{

}

file_mounts_sync_continuously: false

List of commands that will be run before setup_commands. If docker is

enabled, these commands will run outside the container and before docker

is setup.

initialization_commands: []

Custom commands that will be run on the head node after common setup.

head_setup_commands:

  • pip3.7 install -U --user ray==1.3.0

Command to start ray on the head node. You don’t need to change this.

head_start_ray_commands:

  • ray stop
  • ulimit -n 65536; RAY_BACKEND_LOG_LEVEL=error ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:

  • ray stop
    #- source /home/centos/bts/env/bts_setenv.rc
  • ulimit -n 65535; RAY_BACKEND_LOG_LEVEL=error ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Thanks for the extra details.

I tried launching with a similar config and scaling up with some dummy tasks – I’m still not able to reproduce the issue, unfortunately. Scale down happens as expected after the idle period…

If possible could you post log output from
/tmp/ray/session_latest/logs/monitor.*
Also, could you share some details about the workload you are running?

So far, it seems to me that the auto-downscaling itself is functioning correctly. It’s possible something in the Ray internals is preventing the nodes in your cluster from being correctly marked idle.

Hi,

This is still happening to me, but it seems to only happen when my jobs fail due to exceptions or errors.

  1. I create placement groups. 2) run a task on each placement group. 3) the tasks raise exceptions. 4) a large cluster left idle.

I still have no problems with 1.2 though. only happens in 1.3

Thank you!

When everything finishes correctly, the autoscaler does scale down.
If there is a failed task, the autoscaler does not work.

Ah, thanks! Involvement of placement groups and task failures are helpful details.
I wonder what kind of exception is raised. For example, do you see the problem if one of the tasks internally raises a Python exception:

@ray.remote(placement_group=pg)
def fail():
    raise

Could you provide some details on how you’re configuring the placement groups (e.g. placement group strategy, bundles requested), how the tasks are defined, and how the Ray program is deployed?

A detailed minimal reproduction of the problem would be super helpful.
(I know such a repro takes some time and effort to construct…)

@sangcho might have some general insight into stuff related to placement groups.
@Alex might know about autoscaler interaction with placement groups.

Yeah it looks like

  1. repro
  2. logs from monitor (and maybe ray status, which is printing monitor logs)

could help us figuring out what blocks the autoscaler to be scaled down.

@Dmitri as a feature request, we can probably also log why nodes are up and running in ray. That sort of observability will help us investigating this type of issues.

1 Like

I have the same problem, and as well, using ray 1.3.0.
Trying to reproduce.

Here are the config.yaml and code that reproduce this problem.
What do you think is the reason for this? What can be done?

cluster_name: matan
max_workers: 10
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: ai2-israel
auth:
    ssh_user: ray
available_node_types:
    head_node:
        min_workers: 0
        max_workers: 1
        resources: {"CPU": 1}
        node_config:
            machineType: n1-highmem-2
            tags:
              - items: ["allow-all"]
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu-debian-9
    worker_node:
        min_workers: 0
        resources: {"CPU": 2}
        node_config:
            machineType: n1-highmem-2
            tags:
              - items: ["allow-all"]
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu-debian-9
            scheduling:
              - preemptible: false
head_node_type: head_node

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
import ray
import logging
from time import sleep
import ray
import ray.autoscaler.sdk

logger = logging.getLogger(__file__)

@ray.remote(num_cpus=1.0, max_restarts=-1, max_task_retries=-1)
class Reproduce(object):
    def run(self):
        for i in range(10, -1, -1):
            sleep(1)
        return 4

ray.init(address="auto")
ray.autoscaler.sdk.request_resources(num_cpus=10)

arr = []
for _ in range(40):
    c = Reproduce.remote()
    arr.append(c.run.remote())

print(ray.get(arr))

In the case described in the last message, not downscaling is expected.

The request for 10 cpus becomes permanent until overridden by a request for 0 cpus. (I also find this confusing.)

Are you saying that because I explicitly wrote

ray.autoscaler.sdk.request_resources(num_cpus=10)

I should do something like

ray.autoscaler.sdk.request_resources(num_cpus=0)

Indeed: Autoscaler SDK — Ray v2.0.0.dev0
Note:
"This request is persistent until another call to request_resources() is made to override."

The effect is similar to setting min_workers in a cluster config.