How to disable Autoscaler for local cluster

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi,
I am manually setting up the cluster by running ray start with 4 nodes; and all 4 nodes would be removed after one node became idle. Here is the log:

2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 3750ee9c62631c3efbb48b2a9c52471e7d1f3f36425e922eb2d22a20. 2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 3063e53fed3ac900fbbdd2df89936d6df5eaaf6b1aa424a52c16d44e. 2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: 0bd1670240260db79954c8c8d8f15a7fba6b2825be123f9726a17c4c. 2023-03-07 06:00:15,234 INFO load_metrics.py:163 -- LoadMetrics: Removed ip: bbd989be7dfe24ad482333fe669d94c30612629925b8eb174a354bea. 2023-03-07 06:00:15,234 INFO load_metrics.py:169 -- LoadMetrics: Removed 4 stale ip mappings: {'3750ee9c62631c3efbb48b2a9c52471e7d1f3f36425e922eb2d22a20', '3063e53fed3ac900fbbdd2df89936d6df5eaaf6b1aa424a52c16d44e', '0bd1670240260db79954c8c8d8f15a7fba6b2825be123f9726a17c4c', 'bbd989be7dfe24ad482333fe669d94c30612629925b8eb174a354bea'} not in set()

This leads to two questions:

  1. how to prevent nodes becoming idle and removed from cluster; or
  2. is there any configuration that I could set or any guidance to bulid a non-autoscaling clusters (mentioned in Docs Configuring Autoscaling — Ray 2.3.0 but no guidance)?

Thanks!

Hi @JustinY , I would assume setting the min_workers=max_workers=0 would achieve the behavior you want.

@JustinY Did @Lars_Simon_Zehnder suggestion work?

cc: @Alex @Chen_Shen Any ideas what config can be used to mitigate the problem?

+1 to Jules’ question. In particular I’m also worried that the cause and effect may be flipped here. The logs you’ve shared reflect that the autoscaler has acknowledged the nodes have disappeared (but don’t mean the autoscaler removed them).

Those logs are admittedly potentially confusing and should be at a debug level instead.

I just tried set idle_timeout_minutes: 999999 in the YAML config and it did not work. All 4 nodes became idle after being inactivite for a few hours. I will try @Lars_Simon_Zehnder 's suggestion and let you know the result.

1 Like

can you share the yaml file you’re using?

Hi Alex, it’s basically the default config:

cluster_name: default

provider:
    type: local
    head_ip: 172.22.157.115
    worker_ips: [172.22.157.114, 172.22.157.113, 172.22.157.116]

auth:
    ssh_user: root

min_workers: 1
max_workers: 3
upscaling_speed: 1.0
idle_timeout_minutes: 999999

file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: []

head_setup_commands:
    - source activate
    - conda activate option

worker_setup_commands:
    - source activate
    - conda activate option

head_start_ray_commands:
    - conda activate option && ray start --head --port=6379 --dashboard-host='0.0.0.0' --dashboard-port=8265 --include-dashboard=True --num-cpus=32

worker_start_ray_commands:
    - source activate && conda activate option && ray start --address=$RAY_HEAD_IP:6379 --num-cpus=32

I believe the issue is that you need --autoscaling-config=~/ray_bootstrap_config.yaml in your head_start_ray_commands.

head_start_ray_commands:
    - conda activate option && ray start --head --port=6379 --dashboard-host='0.0.0.0' --dashboard-port=8265 --include-dashboard=True --num-cpus=32 --autoscaling-config=~/ray_bootstrap_config.yaml

Also to confirm, above you mentioned using ray start to start your nodes, but with this type of config your top level command should be ray up config.yaml (which will call ray start under the hood)

1 Like

Thanks @alex for the suggestion and additional flag for the autoscale .yaml file. @JustinY Let’s know if the @Alex suggestion resolves your problem.

Hi @Jules_Damji,thanks to @Alex , problem solved and autoscaler works well now. Thanks!