Serving triton models

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello! I am quite new to RayServe and I am checking if it would work for us. We want to serve a bunch of Triton models using RayServe. I have seen the examples offered in the documentation but that assumes that the protocol for input and outputs is using HTTP jsons. Is there a way to use HTTP standard inference protocols? The idea would be to receive requests in that format and just forward them to a in-process triton server

Thanks!

Some of these protocols should be achievable with the Ray Serve FastAPI integration. For example, you could expose FastAPI endpoints in your Serve deployment that respond to requests at v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer (the inference path).

You could also implement the health-check endpoints the same way, but the challenge is that these endpoints will only run on a single replica. So the health check endpoint won’t check the full deployment, just one replica. One workaround might be to make the health-check endpoint send a status request to the Ray dashboard and return the health results from there.

Hello,

thanks for your quick answer. Apart from the endpoint itself, do you know if it is possible to use an already existing Kserve inference protocol client with RayServe, such as the ones offered by Nvidia?

Thanks