Autoscaling with `max_concurrent_queries = 1`

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Is it possible to use Ray Serve autoscaling with the deployment configured to have max_concurrent_queries = 1?

My read of the source with the error_ratio computation is that this isn’t possible, but it’d be helpful if there is a way to do this (likely related to [Feature] Autoscaling Based on Full Request Queue · Issue #20977 · ray-project/ray · GitHub).

Nice find. I do think this is a bug/missing configuration here. The autoscaling_policy reads:

    # Example: if error_ratio == 2.0, we have two times too many ongoing
    # requests per replica, so we desire twice as many replicas.
    error_ratio: float = (
        num_ongoing_requests_per_replica
        / autoscaling_config.target_num_ongoing_requests_per_replica
    )

    # Multiply the distance to 1 by the smoothing ("gain") factor (default=1).
    smoothed_error_ratio = 1 + ((error_ratio - 1) * autoscaling_config.smoothing_factor)
    desired_num_replicas = math.ceil(current_num_replicas * smoothed_error_ratio)

When num_ongoing_requests_per_replica=1 and target=1, the smoothed_error_ratio=1 then desired_num_replicas stay as 1.

Can you please open a github issue so we can track this as a blocker for graduating the autoscaling feature? Thanks

1 Like

Done! [Serve] Can't autoscale deployment when target ongoing requests is 1 · Issue #24793 · ray-project/ray · GitHub