Autoscaler launches extra nodes

eromanova · June 14, 2023, 3:10pm

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Ray 2.4.0, python 3.10, cluster on GCP
I create a GCP cluster with next parameters:
1 head node with 1 CPU, workers with 2 CPU, 0 min workers, 5 max workers, upscaling_speed 1.0 idle_timeout_minutes 1
I deploy Ray Serve application on this cluster with next parameters:

deployments:

name: DLModelProcessor
autoscaling_config:
min_replicas: 0
initial_replicas: 0
max_replicas: 5
target_num_ongoing_requests_per_replica: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
downscale_delay_s: 120.0
upscale_delay_s: 30.0
ray_actor_options:
num_cpus: 2.0
name: Backend
autoscaling_config:
min_replicas: 1
initial_replicas: 1
max_replicas: 1
target_num_ongoing_requests_per_replica: 100.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
downscale_delay_s: 600.0
upscale_delay_s: 30.0
ray_actor_options:
num_cpus: 1.0

When I send request, I expect that 1 DLModelProcessor replica will be created, so 1 new worker node will be launched and the replica will start working there. But after launching 1 new node (for 3 minutes), the launching of another new node starts, at some time I have 2 worker nodes, and after 1 minute the second is terminating because it is idle.
I have seen issue about race condition in node launcher, but advice from there (set provider: foreground_node_launch: True) have not helped me.
When I set both provider: foreground_node_launch: True and enviroment variable AUTOSCALER_UPDATE_INTERVAL_S=10 this problem seems to be solved (5 out of 5 times the only 1 nodes was launched). Just one of this options does not help.
Here is an extract from monitor.log when I send a request to serve with extra node provisioning.

Why do I have this behavier? And how my problem is supposed to be solved correctly?

Topic		Replies	Views
Autoscaling - Adding new worker nodes - stopped? Ray Clusters	0	350	July 15, 2021
Does Ray Autoscaler has a maximum numbers of nodes it can handle? Ray Clusters	2	392	July 30, 2021
Autoscale on custom private cloud Ray Clusters	1	383	December 25, 2021
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	713	November 20, 2023
[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Kubernetes	0	35	September 10, 2024

Autoscaler launches extra nodes

Related topics