Dynamically serve new model via Ray Serve

allan · December 7, 2024, 2:52am

How severe does this issue affect your experience of using Ray?

Low: Looking for general direction

Hi, I would like to ask for a general direction to implement a way to automate the serving of models that has been trained on the Ray Cluster.

The context is that, we have a API service that is connected to the Ray cluster, and allows user to submit their training job with custom scripts, env and datasets. After a model traing job is submitted into the Ray Cluster, the models checkpoint and related information are then stored in MinIO.

According to Ray Serve documentation, to serve a model we need to manually write a script with @serve.deployment for each model. I was wondering if there is a way we can automate the serving process without this step? Assuming we have Ray Serve deployed in a K8s cluster already.

manickavela29 · June 8, 2025, 10:48am

Looking for something similar, any suggestions from ray team?

I have my models mounted on disk, and want to load them on request basis, run inference and bring down the replica when there is no activity.

Model multiplexing helped a bit, but it is occupying additionally memory for cached model,
which is limiting utilization, can this be done well with decomposition? Spawn by new replicas based on request and bring them down when they are inactive?

abrarsheikh · June 9, 2025, 3:22pm

one approach is to change the implementation in your deployment to always fetch the latest checkpoint from where ever the model is stored. Then After the model training step completes, invoke the serve deploy (through the API) step again, this will result in redeployment of ray serve application.

manickavela29 · June 10, 2025, 1:06pm

Can you help me with the documentation link for API for production Kubernetes deployement

abrarsheikh · June 10, 2025, 8:18pm

can you take a look at Deploy Ray Serve Applications — Ray 2.46.0 and using teh k8 python client GitHub - kubernetes-client/python: Official Python client library for kubernetes to invoke the k8 commands from your training script

manickavela29 · June 11, 2025, 10:21am

is there a better solution other than redeploying, my system itself is going to be running dynamically available models, there is going to be a lot of restarting with above approach

I was rustling through multiplexing document and found this example,
if I am able to scale up Downstream deployment to 10, giving it access to model store
then it would kind of give me flexibility to serve models dynamically, greatly reducing up time

@serve.deployment
class Downstream:
    def __call__(self):
        return serve.get_multiplexed_model_id()


@serve.deployment
class Upstream:
    def __init__(self, downstream: DeploymentHandle):
        self._h = downstream

    async def __call__(self, request: starlette.requests.Request):
        return await self._h.options(multiplexed_model_id="bar").remote()


serve.run(Upstream.bind(Downstream.bind()))
resp = requests.get("http://localhost:8000")```

Topic		Replies	Views
Automating the serving of many different models Ray Serve	8	1710	May 3, 2023
Ray serve on Kubernetes Ray Serve	14	944	March 27, 2024
Dynamic Deployment on Ray Serve Ray Serve	3	182	March 4, 2025
Ray serve on K8s Ray Serve	1	627	April 5, 2021
Designing Help: Convert fastapi application to ray serve	3	221	June 4, 2024

Dynamically serve new model via Ray Serve

Related topics