Hi,
I’m very new to model serving (and web things in general), so please excuse any inaccuracies.
I want to serve a model that requires a GPU to run using ray serve and want to use autoscaling to control the number of running replicas and nodes, based on the number of requests the model is receiving. Also when there are no requests I would like to shut down all nodes, as GPUs are quite expensive.
I found references to an experimental autoscaler for serve in the docs, and found the pull request adding it in the source files.
I have two specific questions:
It seems that it was possible to simply pass an autoscaling config via the backend config, but the references to this in the serve controller seem to have disappeared in the current master branch? So was this feature removed entirely or is there a different alternative now?
Does adjusting the number of replicas (automatically via the autoscaler) also automatically adjust the number of actors/nodes? I guess this would then be handled separately via the ray cluster autoscaler?
Hi @JohannesAck, thanks for the question. We are revamping our controller architecture now right so the autoscaling configuration is removed. Additionally, we are still designing a widely applicable autoscaling algorithm. If you can help us share your use case, it can help our design!
As for the scaling nodes, here’s how it will work:
ray serve scales the number of actors
ray core/ray autoscaler notice that resource demand > resource available, it will scale more nodes
I’m not sure if my use case is very representative, but I’m not just serving a single output of a tf/torch model but rather running a search over inputs of a model. The goal is to do text-to-image generation related to big sleep.
The main challenge here is that this takes quite a lot of compute per request, so that for each request I ideally want to start a new GPU instance, as generating an image completely occupies a single instance for up to a minute. I would then also like to quickly shut these instances down again soon to save cost.
Essentially a very expensive computation that needs a single GPU for itself and takes maybe a minute or so to run and the number of requests might vary wildly.
Currently I’m also still struggling to keep the start time sufficiently low, but I guess the solution there might be to simply not use docker but a machine image instead, but this also seems like it’s not specifically supported by ray. But I think this should be relatively straight forward to do on my own.
Anyways I’m looking forward to serve autoscaling coming back eventually and thanks for the work in general, ray is great!