Ray serve deployment is not scaling up, ongoing request is always 0

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Ray team, I have deployed a ray cluster with kuberay and successfully served a simple synchronous yolov7 model in pytorch using grpc, it process requests without any issue but I can’t get the up-scale part of the autoscaling to work.

The symptom is

  • I can see the ray_serve_num_ongoing_grpc_requests piling up, meaning the requests are all queued at the proxy
  • ray_serve_replica_processing_queries is always 0
  • ray_serve_replica_pending_queries is always 0

My deployment config is as followed:

max_concurrent_queries: 10
user_config: null
autoscaling_config:
  min_replicas: 0
  initial_replicas: 1
  max_replicas: 8
  target_num_ongoing_requests_per_replica: 1
  metrics_interval_s: 2
  look_back_period_s: 4
  smoothing_factor: 0.8
  upscale_smoothing_factor: 0.8
  downscale_smoothing_factor: 0.3
  downscale_delay_s: 600
  upscale_delay_s: 10
graceful_shutdown_wait_loop_s: 2
graceful_shutdown_timeout_s: 20
health_check_period_s: 10
health_check_timeout_s: 30

and the replica even scale down to 0 while many ongoing requests is happening, because the ray_serve_replica_processing_queries is 0.

I am not sure if its because I’m serving the endpoint with grpc, any help is welcome as I’ve been scratching my head for quite some time.
Thanks!

Hi, can you please tell me if you managed to solve the problem? Do you have the yolo model converted to onnx too?