How to get Streaming output functions well?

jiaanguo · April 11, 2023, 4:20am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

app = FastAPI()

@serve.deployment(route_prefix="/", ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class ChatbotModelDeployment:

    @app.post("/")
    async def query(self, request: Request):
        data = await request.json()
        query = data.get("query", "")
        max_length = data.get("max_length", 2048)
        top_p = data.get("top_p", 0.9)
        temperature = data.get("temperature", 0.7)
        use_stream_chat = data.get("use_stream_chat", True)

        output = self._infer(query, None, max_length, top_p,
                             temperature, use_stream_chat)
        return StreamingResponse(output, media_type="text/plain")

chatbot_model_deployment = ChatbotModelDeployment.bind()

I am building a simple NLP model server with Ray Serve + FastAPI, however I could not build a streaming server that output tokens one by one. (The output is whole paragraph all at once). Any ideas on how to fix that? Many thanks in advance.

Akshay_Malik · April 11, 2023, 10:58pm

Ray Serve does not support streaming responses yet. Please file a feature request here - Issues · ray-project/ray · GitHub

Jules_Damji · April 11, 2023, 11:09pm

@jiaanguo Since this functionality is not currently supported, can you file an issue as requested by @Akshay_Malik, and we can close this as resolved.

If you file an issue, include the link to the issue in here, please. Thanks!

jiaanguo · April 12, 2023, 4:07am

github.com/ray-project/ray

[Ray Serve] need support for streaming outputs

opened 09:35AM - 11 Apr 23 UTC

jiaanguo

enhancement triage serve

### Description I am building a chatbot with ray serves my model, currently m…y model supports streaming outputs like generating word tokens one after another. However, I could not get ray to send token back immediately in streaming manner. This would be a feature greatly required! ``` @serve.deployment(route_prefix="/", ray_actor_options={"num_gpus": 1}) @serve.ingress(app) class ChatbotModelDeployment: def _infer(...): for output in self._model.stream_chat(...): yield output+'\n' @app.post("/") async def query(self, request: Request): data = await request.json() query = data.get("query", "") output = self._infer(query, ...) return StreamingResponse(output, media_type="text/plain") chatbot_model_deployment = ChatbotModelDeployment.bind() ``` Something like above that allows StreamingResponse from FastAPI, currently even I tried using Streaming Response, it still responses the whole paragraph once all tokens have been generated. ### Use case Especially in nlp model deployment, need this feature to send token back directly rather than wait for the whole paragraph being generated.

I have created a new issue on github, link is above.

You could close this, many thanks.

Topic		Replies	Views
Keypoint streaming usecase Ray Serve	7	595	May 26, 2022
Official Ray FastAPI tutorial - how to craft a request? Ray Serve	7	1716	June 30, 2021
Ray Serve: Ray Serve vs Regular Web server Performance? Ray Serve	2	1289	January 5, 2022
Unable to get started with Ray Serve + FastAPI Ray Serve	1	1816	January 8, 2023
Ray Serve with vs without FastAPI Ray Serve	3	1768	March 4, 2021

How to get Streaming output functions well?

Related topics