Serve Controller high cpu usage when using autoscaling

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When I specify autoscaling config for my ray serve deployment, under the ‘CPU’ column of serve controller on the ‘Cluster’ dashboard page, the CPU usage keeps increasing as time goes by. After awhile, the CPU usage gets so high (above 100%) that my pipeline just hangs and stops running. At the same time, the ‘node network’ graph slowly decreases to 0.

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 2,
        "upscale_delay_s": 0.1,
        "downscale_delay_s": 60,
        "smoothing_factor": 100
    },
    ray_actor_options={"num_cpus": 1, "num_gpus": 0.3},
)

I confirmed this by removing the autoscaling config and running a fixed number of replicas, and this problem goes away.

@serve.deployment(
    num_replicas=1,
    ray_actor_options={"num_cpus": 1, "num_gpus": 0.3}
)