How to get Streaming output functions well?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.
app = FastAPI()

@serve.deployment(route_prefix="/", ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class ChatbotModelDeployment:

    @app.post("/")
    async def query(self, request: Request):
        data = await request.json()
        query = data.get("query", "")
        max_length = data.get("max_length", 2048)
        top_p = data.get("top_p", 0.9)
        temperature = data.get("temperature", 0.7)
        use_stream_chat = data.get("use_stream_chat", True)

        output = self._infer(query, None, max_length, top_p,
                             temperature, use_stream_chat)
        return StreamingResponse(output, media_type="text/plain")

chatbot_model_deployment = ChatbotModelDeployment.bind()

I am building a simple NLP model server with Ray Serve + FastAPI, however I could not build a streaming server that output tokens one by one. (The output is whole paragraph all at once). Any ideas on how to fix that? Many thanks in advance.

Ray Serve does not support streaming responses yet. Please file a feature request here - Issues · ray-project/ray · GitHub

@jiaanguo Since this functionality is not currently supported, can you file an issue as requested by @Akshay_Malik, and we can close this as resolved.

If you file an issue, include the link to the issue in here, please. Thanks!

I have created a new issue on github, link is above.

You could close this, many thanks.