I am running ray serve on an auto-scaling ray cluster deployed on kubernetes. I have created a deployment with an autoscaling configuration.
The scale-up and scale-down of the replicas work quite well. When all the replicas can fit on one kubernetes pod ( worker node), there is no issue. When the replicas do not fit, it triggers the auto-scaling of the cluster and replicas get scheduled on new worker pod.
When scaling down however it seems the replica deletion follows a FIFO policy, the initial replica is deleted, and the replicas running on the new pod created after autoscaling remains. As there are other workloads running on the old pod, the ray cluster itself does not auto-scale down.
Is there a specific strategy that is followed when choosing which replicas to delete when scaling down? Could it be possible to make the strategy configurable?