Multiplexing and batching go together?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Can multiplexing and batching be combined together? Looked into the documentation and could find any examples.

Hello there! I believe you can combine multiplexing and dynamic request batching in Ray Serve. The official docs don’t have an example of them together, but I think they can be used in the same deployment.

To get it working, you’d add both the @serve.multiplexed and @serve.batch decorators to your deployment’s methods together.

Multiplexed example: Model Multiplexing — Ray 2.46.0

Dynamic request batching example: Dynamic Request Batching — Ray 2.46.0

Great thanks, I will give it an attempt

Hi @christina,

This batching with multiplexing doesn’t seems to be working well together
Observing failures from both the features of multiplexing and batching implemented together

  1. Multiplexing stuck at first few loaded models

With mulitplexing max models as 3, I had batching enabled as per params but model_id = serve.get_multiplexed_model_id() will just fetch the first loaded model and use it to run inference across for all inference request instead of mulitplexing

Multiplexing is failing with Batching system

2. Batching is not based on multiplexed model id’s

Batching of requests seems to be not based on multiplexed model id, or it is just failing because of first issue

Others also seems to have observed similar issue:

Very disheartening to see that, such a great feature is missing in the system!.

Thank you

You are correct: as of Ray 2.48.0, batching and multiplexing do not work together as expected. When both are enabled, batched requests are not grouped by multiplexed model ID, and the batching context does not properly propagate the correct model ID to each request in the batch. This results in all batched requests using the first loaded model, regardless of the intended model ID, as described in your logs and confirmed by the open GitHub issue [Serve] model multiplexing and batching does not work together.

This is a known limitation, and the Ray team is aware of it. There is ongoing discussion and a related PR to improve request context handling in batches, but as of now, batching is not aware of multiplexed model IDs and does not guarantee that requests for different models are separated into different batches [Serve] group requests by model_id in Model Multiplexing, [Serve] Allow access to request context of each request in the batch. Would you like a step-by-step breakdown of the technical reasons and possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.