Ray serve deployment is not scaling up, ongoing request is always 0

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Ray team, I have deployed a ray cluster with kuberay and successfully served a simple synchronous yolov7 model in pytorch using grpc, it process requests without any issue but I can’t get the up-scale part of the autoscaling to work.

The symptom is

  • I can see the ray_serve_num_ongoing_grpc_requests piling up, meaning the requests are all queued at the proxy
  • ray_serve_replica_processing_queries is always 0
  • ray_serve_replica_pending_queries is always 0

My deployment config is as followed:

max_concurrent_queries: 10
user_config: null
autoscaling_config:
  min_replicas: 0
  initial_replicas: 1
  max_replicas: 8
  target_num_ongoing_requests_per_replica: 1
  metrics_interval_s: 2
  look_back_period_s: 4
  smoothing_factor: 0.8
  upscale_smoothing_factor: 0.8
  downscale_smoothing_factor: 0.3
  downscale_delay_s: 600
  upscale_delay_s: 10
graceful_shutdown_wait_loop_s: 2
graceful_shutdown_timeout_s: 20
health_check_period_s: 10
health_check_timeout_s: 30

and the replica even scale down to 0 while many ongoing requests is happening, because the ray_serve_replica_processing_queries is 0.

I am not sure if its because I’m serving the endpoint with grpc, any help is welcome as I’ve been scratching my head for quite some time.
Thanks!