Hi Seiji,
Thank you for the pointer!
I was able to get the config.yaml
working by generating it using the following command:
serve build image_classifier:app text_translator:app llm_app:app -o config.yaml
I also confirmed that removing the deployments
section, as you pointed out above, works as well.
However, when I attempted to send requests between applications as described in the Ray docs example (Deploy Multiple Applications — Ray 2.46.0), I encountered the following error:
AttributeError: 'dict' object has no attribute 'stream'" error.
The same request using the OpenAI Python client works fine when I deploy and call the LLMs app directly, as shown in this Ray docs guide (Serving LLMs — Ray 2.46.0), without setting the stream=True
parameter. I assume the chat completion call is non-streaming unless stream=True
is explicitly set.
I tried setting both stream=False
and stream=True
explicitly in test.py, but it also didn’t work.
How can I make the same OpenAI Python client call work, with either streaming or non-streaming, when using multiple apps and sending requests between applications?
Your help is greatly appreciated!
For your reference:
# llm_app.py
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
llm_config = LLMConfig(
model_loading_config=dict(
model_id="meta-llama/Llama-3.2-3B-Instruct",
model_source="meta-llama/Llama-3.2-3B-Instruct",
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=1, max_replicas=2,
)
),
accelerator_type="AMD-Instinct-MI250X-MI250",
engine_kwargs=dict(
tensor_parallel_size=1,
),
)
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
app = LLMRouter.as_deployment().bind([deployment])
# image_clasiffier.py
async def __call__(self, req: starlette.requests.Request):
req = await req.json()
if req["model"] in ["meta-llama/Llama-3.2-3B-Instruct"]:
handle: DeploymentHandle = serve.get_app_handle("app3")
return await handle.chat.remote(req)
# test.py
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/app1",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": “Tell me a joke.”}],
),
# stream=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(AttributeError): e[36mray::ServeReplica:app3:LLMRouter.handle_request_with_rejection()e[39m (pid=1612130, ip=10.42.3.40, actor_id=c64c30b9729717d4179c50a553000000, repr=<ray.serve._private.replica.ServeReplica:app3:LLMRouter object at 0x7fb023a67ce0>)
async for result in self._replica_impl.handle_request_with_rejection(
File "/usr/local/lib/python3.12/dist-packages/ray/serve/_private/replica.py", line 656, in handle_request_with_rejection
yield await asyncio.wrap_future(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/serve/_private/replica.py", line 1610, in call_user_method
result, sync_gen_consumed = await self._call_func_or_gen(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/serve/_private/replica.py", line 1328, in _call_func_or_gen
result = await result
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/llm/_internal/serve/deployments/routers/router.py", line 392, in chat
if body.stream:
^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'stream'
INFO 2025-05-20 17:57:12,312 app1_ImageClassifier zeo46uyj fffeb9ce-fc23-47c2-933a-00a77138eff1 -- POST / 500 17.7ms