Serve Controller high cpu usage when using autoscaling

yhsmiley · June 16, 2023, 7:43am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When I specify autoscaling config for my ray serve deployment, under the ‘CPU’ column of serve controller on the ‘Cluster’ dashboard page, the CPU usage keeps increasing as time goes by. After awhile, the CPU usage gets so high (above 100%) that my pipeline just hangs and stops running. At the same time, the ‘node network’ graph slowly decreases to 0.

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 2,
        "upscale_delay_s": 0.1,
        "downscale_delay_s": 60,
        "smoothing_factor": 100
    },
    ray_actor_options={"num_cpus": 1, "num_gpus": 0.3},
)

I confirmed this by removing the autoscaling config and running a fixed number of replicas, and this problem goes away.

@serve.deployment(
    num_replicas=1,
    ray_actor_options={"num_cpus": 1, "num_gpus": 0.3}
)

Topic		Replies	Views
Ray Serve Autoscaling: Autoscaling backend-replicas removed? Ray Serve	3	494	February 18, 2021
Ray Serve replica level autoscaling not working with Kube deployment Ray Serve	3	31	June 11, 2025
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	731	November 20, 2023
Ray autoscaling despite hard limit on number of replicas	1	47	December 6, 2024
Scaling Ray Serve efficiently Ray Serve	0	59	December 10, 2024

Serve Controller high cpu usage when using autoscaling

Related topics