Ray Serve: custom resource optimization

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hey, we are working with some custom hardware on which we can run inference of neural networks and I am looking at Ray Serve to quickly spin up some cluster that we can easily scale up/down when needed. I understand how to assign custom resources to each machine, so that I can connect X pieces of hardware, and I understand the basics of serve.

The catch is, I’d like to be able to perform inference on various models, such that in my request I specify something like (model, image). Now that could be easily done by simply defining different deployments for each model, where in __init__ I initialize the custom hardware and load the model (similar as in Pytorch model serving example with conditional inference). But I’d like to make it more general - perhaps we’d have more models than possible deployments (hardware). This lead me to thinking that I could initialize the model in __call__, but this adds some overhead as it takes some time to load the model on custom hardware and perform the inference. When model is loaded, the inference on subsequent messages is faster.

I am curious if there would be some better way to utilize Ray Serve for that. It could help if I could initialize the model on a deployment for a certain period, and then perform inference on all of the queued messages for that model. Afterward a different model would be initialized. I am wondering if something like that would be possible?

Thanks,
Matija

Hi @waterf00l, welcome to the forums and thanks for posting!

Could you put each model in its own deployment, and run the deployments on the corresponding hardware using custom resources? That would let each model stay live and respond to requests quickly.

Alternatively, I wonder if Serve’s async/await syntax would be helpful here. Suppose you have 3 models A, B, C:

@serve.deployment
class ModelManager:

    def __init__(self):
        self.message_queue = dict()  # Maps model name to integer
        self.model_queue = list() # List of model names
        self.model = None
        self.model_name = ""

    async def __call__(self, request):
        self.message_queue[request.model_name] = self.message_queue.get(request.model_name, 0) + 1
        if request.model_name not in self.model_queue:
            self.model_queue.append(request.model_name)
            self.load_model(request.model_name)  # Runs asynchronously!
        while self.model_name != request.model_name:
            await asyncio.sleep(0.1)
        output = self.model.call(request.input)
        self.message_queue[request.model_name] -= 1
        return output

    async def load_model(self, model_name: str):
        while self.model_queue[0] != model_name:
            await asyncio.sleep(0.1)
        # TODO: write logic to load and store the model in self.model using model_name
        ...
        while self.message_queue.get(model_name, 0) > 0:
            await asyncio.sleep(0.1)
        self.model_queue.pop(0)
        return

This would load models one after another, and each model will process all the buffered requests for that model. There’s probably a way to write this more efficiently without all the asyncio.sleep() calls, but hopefully it shows how async/await syntax might help your use case.

Hey @shrekris , thanks for the fast response!

Could you put each model in its own deployment, and run the deployments on the corresponding hardware using custom resources? That would let each model stay live and respond to requests quickly.

This was my initial idea. The problem is that each deployment would then load the model and utilize the hardware. But I might have more models that I want to run than available hardware/resources. Imagine like 10 PyTorch models but only 5 GPUs, just that it’s not GPUs and harder to scale.

I was thinking if there might exist some solution, where I could mark which model is currently active on a certain deployment, and this queuing/buffering would happen on the head/controller. So deployment would be general and would also take model as a parameter. And controller could then decide and efficiently schedule which models to load onto which deployment and how to forward the messages.

Thanks for sharing the async/await syntax. I will definitely give it a look! :slight_smile:

I see, thanks for explaining! Please keep us posted on how the async/await syntax goes. Feel free to post a feature request for this pattern as well. It’ll help us track it going forward.