How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
app = FastAPI()
@serve.deployment(route_prefix="/", ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class ChatbotModelDeployment:
@app.post("/")
async def query(self, request: Request):
data = await request.json()
query = data.get("query", "")
max_length = data.get("max_length", 2048)
top_p = data.get("top_p", 0.9)
temperature = data.get("temperature", 0.7)
use_stream_chat = data.get("use_stream_chat", True)
output = self._infer(query, None, max_length, top_p,
temperature, use_stream_chat)
return StreamingResponse(output, media_type="text/plain")
chatbot_model_deployment = ChatbotModelDeployment.bind()
I am building a simple NLP model server with Ray Serve + FastAPI, however I could not build a streaming server that output tokens one by one. (The output is whole paragraph all at once). Any ideas on how to fix that? Many thanks in advance.
Ray Serve does not support streaming responses yet. Please file a feature request here - Issues · ray-project/ray · GitHub
@jiaanguo Since this functionality is not currently supported, can you file an issue as requested by @Akshay_Malik , and we can close this as resolved.
If you file an issue, include the link to the issue in here, please. Thanks!
opened 09:35AM - 11 Apr 23 UTC
enhancement
triage
serve
### Description
I am building a chatbot with ray serves my model, currently m… y model supports streaming outputs like generating word tokens one after another. However, I could not get ray to send token back immediately in streaming manner. This would be a feature greatly required!
```
@serve.deployment(route_prefix="/", ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class ChatbotModelDeployment:
def _infer(...):
for output in self._model.stream_chat(...):
yield output+'\n'
@app.post("/")
async def query(self, request: Request):
data = await request.json()
query = data.get("query", "")
output = self._infer(query, ...)
return StreamingResponse(output, media_type="text/plain")
chatbot_model_deployment = ChatbotModelDeployment.bind()
```
Something like above that allows StreamingResponse from FastAPI, currently even I tried using Streaming Response, it still responses the whole paragraph once all tokens have been generated.
### Use case
Especially in nlp model deployment, need this feature to send token back directly rather than wait for the whole paragraph being generated.
I have created a new issue on github, link is above.
You could close this, many thanks.