Dynamically serve new model via Ray Serve

How severe does this issue affect your experience of using Ray?

  • Low: Looking for general direction

Hi, I would like to ask for a general direction to implement a way to automate the serving of models that has been trained on the Ray Cluster.

The context is that, we have a API service that is connected to the Ray cluster, and allows user to submit their training job with custom scripts, env and datasets. After a model traing job is submitted into the Ray Cluster, the models checkpoint and related information are then stored in MinIO.

According to Ray Serve documentation, to serve a model we need to manually write a script with @serve.deployment for each model. I was wondering if there is a way we can automate the serving process without this step? Assuming we have Ray Serve deployed in a K8s cluster already.

Looking for something similar, any suggestions from ray team?

I have my models mounted on disk, and want to load them on request basis, run inference and bring down the replica when there is no activity.

Model multiplexing helped a bit, but it is occupying additionally memory for cached model,
which is limiting utilization, can this be done well with decomposition? Spawn by new replicas based on request and bring them down when they are inactive?

one approach is to change the implementation in your deployment to always fetch the latest checkpoint from where ever the model is stored. Then After the model training step completes, invoke the serve deploy (through the API) step again, this will result in redeployment of ray serve application.

Can you help me with the documentation link for API for production Kubernetes deployement

can you take a look at Deploy Ray Serve Applications — Ray 2.46.0 and using teh k8 python client GitHub - kubernetes-client/python: Official Python client library for kubernetes to invoke the k8 commands from your training script

1 Like

is there a better solution other than redeploying, my system itself is going to be running dynamically available models, there is going to be a lot of restarting with above approach

I was rustling through multiplexing document and found this example,
if I am able to scale up Downstream deployment to 10, giving it access to model store
then it would kind of give me flexibility to serve models dynamically, greatly reducing up time

@serve.deployment
class Downstream:
    def __call__(self):
        return serve.get_multiplexed_model_id()


@serve.deployment
class Upstream:
    def __init__(self, downstream: DeploymentHandle):
        self._h = downstream

    async def __call__(self, request: starlette.requests.Request):
        return await self._h.options(multiplexed_model_id="bar").remote()


serve.run(Upstream.bind(Downstream.bind()))
resp = requests.get("http://localhost:8000")```