LLM Batch Model loading using runai_streamer is very slow

sologymming · July 31, 2025, 10:00pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[V] High: Completely blocks me.

2. Environment:

Ray version: 2.48.0
Python version: 3.11
OS: Amazon Linux 2023
Cloud/Infrastructure: AWS
Other libs/tools (if relevant): vllm = 0.10.0

I’m hosting hugging face LLMs on s3, and when I launch them with vllm serve with the load-format flag turned on to runai_streamer it works as expected and I’m seeing very fast model loading times. However, when I use the example with Ray data to launch the same model config onto a Ray cluster to perform batch offline inference, the model download speed is horrendously slow. It seems to download the models to disk first before loading it into GPU, but even the download to disk is a lot slower than expected. For reference I’m using this example with engine_kwargs={“load_format”: “runai_streamer”}:
Working with LLMs — Ray 2.48.0

Does anyone know why this is or if I’m missing anything? I’ve tried playing with the concurrency setting as well which didn’t help at all.

sologymming · August 13, 2025, 6:14pm

After digging into the code, it seems that Ray will download the model before initializing the vllm engine so it doesn’t take in account whether or not we specified that the model should be loaded in via runai_streamer. Github issue: [Data/LLM] vLLM model files are downloaded to disk even when “load_format”: “runai_streamer” is specified in engine_kwargs · Issue #55574 · ray-project/ray

Topic		Replies	Views
How to deploy LLM models that can handle high concurrency based on the Ray serve framework Ray Serve	1	1103	January 8, 2024
Download an opensource LLM model in Raycluster yaml file?	2	268	December 14, 2023
Ray Serve LLM APIs has 2~3x higher latency Ray Serve LLM APIs	7	322	May 19, 2025
Utilizing Ray Dataset for faster loading/saving RLlib	1	245	July 27, 2022
Sharing big ML models using only Ray Core Ray Core	1	417	July 6, 2022

LLM Batch Model loading using runai_streamer is very slow

Related topics