I’m very new to model serving (and web things in general), so please excuse any inaccuracies.
I want to serve a model that requires a GPU to run using ray serve and want to use autoscaling to control the number of running replicas and nodes, based on the number of requests the model is receiving. Also when there are no requests I would like to shut down all nodes, as GPUs are quite expensive.
I found references to an experimental autoscaler for serve in the docs, and found the pull request adding it in the source files.
I have two specific questions:
It seems that it was possible to simply pass an autoscaling config via the backend config, but the references to this in the serve controller seem to have disappeared in the current master branch? So was this feature removed entirely or is there a different alternative now?
Does adjusting the number of replicas (automatically via the autoscaler) also automatically adjust the number of actors/nodes? I guess this would then be handled separately via the ray cluster autoscaler?