Serving triton models

Ricard_Borras · September 12, 2024, 4:38pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello! I am quite new to RayServe and I am checking if it would work for us. We want to serve a bunch of Triton models using RayServe. I have seen the examples offered in the documentation but that assumes that the protocol for input and outputs is using HTTP jsons. Is there a way to use HTTP standard inference protocols? The idea would be to receive requests in that format and just forward them to a in-process triton server

Thanks!

shrekris · September 12, 2024, 5:19pm

Some of these protocols should be achievable with the Ray Serve FastAPI integration. For example, you could expose FastAPI endpoints in your Serve deployment that respond to requests at v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer (the inference path).

You could also implement the health-check endpoints the same way, but the challenge is that these endpoints will only run on a single replica. So the health check endpoint won’t check the full deployment, just one replica. One workaround might be to make the health-check endpoint send a status request to the Ray dashboard and return the health results from there.

Ricard_Borras · September 13, 2024, 7:13am

Hello,

thanks for your quick answer. Apart from the endpoint itself, do you know if it is possible to use an already existing Kserve inference protocol client with RayServe, such as the ones offered by Nvidia?

Thanks

Topic		Replies	Views
Unable to get started with Ray Serve + FastAPI Ray Serve	1	1826	January 8, 2023
Ray Serve FastAPI Recommended Approach Ray Serve	1	1310	August 10, 2021
POST method using HTTP request with Ray Serve Ray Serve	5	1345	December 16, 2021
Official Ray FastAPI tutorial - how to craft a request? Ray Serve	7	1732	June 30, 2021
Ray Serve with vs without FastAPI Ray Serve	3	1784	March 4, 2021

Serving triton models

Related topics