How to route traffic to LiteLLM models using Serving LLMs

James_Wong · May 15, 2025, 6:37pm

We’re currently serving vLLM models via the ‘Serving LLMs’, Serving LLMs — Ray 2.46.0

Is there a way or an example to use Ray Serve to pass through LiteLLM models?

We’re looking at the ‘Deploy Compositions of Models’ and ‘Deploy Multiple Applications’ sections, but I’m unsure which, if any, is the recommended approach to add the LiteLLM pass-through in Serving LLMs.

christina · May 16, 2025, 9:04pm

Hi James!
What LiteLLM model are you trying to serve right now and do you have any code snippets to share on what your current code looks like? Such as what issues are you having?

James_Wong · May 19, 2025, 5:44pm

Hi Christina,

More generally, how can we run the Serving LLMs app (Serving LLMs — Ray 2.46.0) alongside other applications described in the Deploy Multiple Applications section (Deploy Multiple Applications — Ray 2.46.0), so that requests can also be routed to non-LLM apps?

For example, how can we include serve_llm_config.yaml as another app inside multi_app_config.yaml? We tried, but couldn’t get it to work.

Thank you!

# serve_llm_config.yaml
applications:
- args:
    llm_configs:
        - model_loading_config:
            model_id: qwen-0.5b
            model_source: Qwen/Qwen2.5-0.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
        - model_loading_config:
            model_id: qwen-1.5b
            model_source: Qwen/Qwen2.5-1.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
  import_path: ray.serve.llm:build_openai_app
  name: llm_app
  route_prefix: "/"

# multi_app_config.yaml
proxy_location: EveryNode

http_options:
  host: 0.0.0.0
  port: 8000

grpc_options:
  port: 9000
  grpc_servicer_functions: []

logging_config:
  encoding: JSON
  log_level: INFO
  logs_dir: null
  enable_access_log: true

applications:
  - name: app1
    route_prefix: /classify
    import_path: image_classifier:app
    runtime_env: {}
    deployments:
      - name: downloader
      - name: ImageClassifier

  - name: app2
    route_prefix: /translate
    import_path: text_translator:app
    runtime_env: {}
    deployments:
      - name: Translator

Akshay_Malik · May 19, 2025, 9:11pm

Hi @James_Wong, setting up multiple apps should be possible under a different route_prefix for each app. Can you share the config you tried that doesn’t work?

James_Wong · May 19, 2025, 10:18pm

Hi Askshay,

Using the provided examples and config.yaml. Thanks!

# llm_app.py
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs=dict(
        tensor_parallel_size=2,
    ),
)

# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
app = LLMRouter.as_deployment().bind([deployment])

# image_classifier.py
import requests
import starlette

from transformers import pipeline
from io import BytesIO
from PIL import Image

from ray import serve
from ray.serve.handle import DeploymentHandle


@serve.deployment
def downloader(image_url: str):
    image_bytes = requests.get(image_url).content
    image = Image.open(BytesIO(image_bytes)).convert("RGB")
    return image


@serve.deployment
class ImageClassifier:
    def __init__(self, downloader: DeploymentHandle):
        self.downloader = downloader
        self.model = pipeline(
            "image-classification", model="google/vit-base-patch16-224"
        )

    async def classify(self, image_url: str) -> str:
        image = await self.downloader.remote(image_url)
        results = self.model(image)
        return results[0]["label"]

    async def __call__(self, req: starlette.requests.Request):
        req = await req.json()
        result = await self.classify(req["image_url"])
        if req.get("model") is True:
            print("Using app3")
            handle: DeploymentHandle = serve.get_app_handle("app3")
            return await handle.translate.remote(result)
        print("Using app1")
        if req.get("should_translate") is True:
            handle: DeploymentHandle = serve.get_app_handle("app2")
            return await handle.translate.remote(result)


app = ImageClassifier.bind(downloader.bind())

# text_translator.py
import starlette

from transformers import pipeline

from ray import serve


@serve.deployment
class Translator:
    def __init__(self):
        self.model = pipeline("translation_en_to_de", model="t5-small")

    def translate(self, text: str) -> str:
        return self.model(text)[0]["translation_text"]

    async def __call__(self, req: starlette.requests.Request):
        req = await req.json()
        return self.translate(req["text"])


app = Translator.bind()

proxy_location: EveryNode
http_options:
  host: 0.0.0.0
  port: 8000
grpc_options:
  port: 9000
  grpc_servicer_functions: []
logging_config:
  encoding: TEXT
  log_level: INFO
  logs_dir: null
  enable_access_log: true
applications:
- name: app1
  route_prefix: /app1
  import_path: image_classifier:app
  runtime_env: {}
  deployments:
  - name: downloader
  - name: ImageClassifier
- name: app2
  route_prefix: /app2
  import_path: text_translator:app
  runtime_env: {}
  deployments:
  - name: Translator
- name: app3
  route_prefix: /app3
  import_path: llm_app:app
  runtime_env: {}
  deployments:
  - name: vLLM

ray job submit --address=http://localhost:8265 \
  --working-dir ./ \
  -- serve run --non-blocking config.yaml

Got the following error:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/_private/application_state.py", line 679, in _reconcile_build_app_task
    overrided_infos = override_deployment_info(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/_private/application_state.py", line 1228, in override_deployment_info
    raise ValueError(
ValueError: Got config override for nonexistent deployment 'vLLM'

seiji · May 20, 2025, 12:28am

Hi @James_Wong! Thanks for the updating with the code.

Can you try removing the lines

  deployments:
  - name: vLLM

from config.yaml and submitting the job again?

James_Wong · May 20, 2025, 6:11pm

Hi Seiji,

Thank you for the pointer!

I was able to get the config.yaml working by generating it using the following command:

serve build image_classifier:app text_translator:app llm_app:app -o config.yaml

I also confirmed that removing the deployments section, as you pointed out above, works as well.

However, when I attempted to send requests between applications as described in the Ray docs example (Deploy Multiple Applications — Ray 2.46.0), I encountered the following error:

AttributeError: 'dict' object has no attribute 'stream'" error.

The same request using the OpenAI Python client works fine when I deploy and call the LLMs app directly, as shown in this Ray docs guide (Serving LLMs — Ray 2.46.0), without setting the stream=True parameter. I assume the chat completion call is non-streaming unless stream=True is explicitly set.

I tried setting both stream=False and stream=True explicitly in test.py, but it also didn’t work.

How can I make the same OpenAI Python client call work, with either streaming or non-streaming, when using multiple apps and sending requests between applications?

Your help is greatly appreciated!

For your reference:

# llm_app.py
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="meta-llama/Llama-3.2-3B-Instruct",
        model_source="meta-llama/Llama-3.2-3B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="AMD-Instinct-MI250X-MI250",
    engine_kwargs=dict(
        tensor_parallel_size=1,
    ),
)

deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
app = LLMRouter.as_deployment().bind([deployment])

# image_clasiffier.py
    async def __call__(self, req: starlette.requests.Request):
        req = await req.json()
        
        if req["model"] in ["meta-llama/Llama-3.2-3B-Instruct"]:
            handle: DeploymentHandle = serve.get_app_handle("app3")
            return await handle.chat.remote(req)

# test.py
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/app1",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": “Tell me a joke.”}],
),
# stream=True,

 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(AttributeError): e[36mray::ServeReplica:app3:LLMRouter.handle_request_with_rejection()e[39m (pid=1612130, ip=10.42.3.40, actor_id=c64c30b9729717d4179c50a553000000, repr=<ray.serve._private.replica.ServeReplica:app3:LLMRouter object at 0x7fb023a67ce0>)
 async for result in self._replica_impl.handle_request_with_rejection(
 File "/usr/local/lib/python3.12/dist-packages/ray/serve/_private/replica.py", line 656, in handle_request_with_rejection
 yield await asyncio.wrap_future(
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/ray/serve/_private/replica.py", line 1610, in call_user_method
 result, sync_gen_consumed = await self._call_func_or_gen(
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/ray/serve/_private/replica.py", line 1328, in _call_func_or_gen
 result = await result
 ^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/ray/llm/_internal/serve/deployments/routers/router.py", line 392, in chat
 if body.stream:
 ^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'stream'
INFO 2025-05-20 17:57:12,312 app1_ImageClassifier zeo46uyj fffeb9ce-fc23-47c2-933a-00a77138eff1 -- POST / 500 17.7ms

seiji · May 20, 2025, 7:01pm

No problem, thanks for reaching out @James_Wong. Seems like here we have

req: dict = await req.json()

But handle.chat.remote() is expecting a ChatCompletionRequest. Can we try the following instead?

# image_classifier.py
from ray.serve.llm.openai_api_models import ChatCompletionRequest


@serve.deployment
class ImageClassifier:
    ...
    async def __call__(self, req: starlette.requests.Request):
        req = await req.json()
        result = await self.classify(req["image_url"]) # Assuming this based on earlier code
        if req["model"] in ["meta-llama/Llama-3.2-3B-Instruct"]:
            handle: DeploymentHandle = serve.get_app_handle("app3")

            request = ChatCompletionRequest(
                model="meta-llama/Llama-3.2-3B-Instruct",
                messages=[
                    {
                        "role": "user",
                        "content": result
                    }
                ]
            )
           return ray.get(handle.chat.remote(request))

EDIT: typo

Topic		Replies	Views
About the Ray Serve LLM APIs category Ray Serve LLM APIs	0	17	April 2, 2025
Multiple Independent Models behind a single API endpoint? Ray Serve	3	93	January 30, 2025
Ray Serve LLM example in document cannot work Ray Serve LLM APIs	6	183	April 3, 2025
Ray serve blocking requests when serving an LLM Ray Serve	3	140	October 20, 2024
Serving LLM with multiple gpus Ray Serve	0	270	July 3, 2024

How to route traffic to LiteLLM models using Serving LLMs

Related topics