LLM Batch Model loading using runai_streamer is very slow

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[V] High: Completely blocks me.

2. Environment:

  • Ray version: 2.48.0

  • Python version: 3.11

  • OS: Amazon Linux 2023

  • Cloud/Infrastructure: AWS

  • Other libs/tools (if relevant): vllm = 0.10.0

I’m hosting hugging face LLMs on s3, and when I launch them with vllm serve with the load-format flag turned on to runai_streamer it works as expected and I’m seeing very fast model loading times. However, when I use the example with Ray data to launch the same model config onto a Ray cluster to perform batch offline inference, the model download speed is horrendously slow. It seems to download the models to disk first before loading it into GPU, but even the download to disk is a lot slower than expected. For reference I’m using this example with engine_kwargs={“load_format”: “runai_streamer”}:
Working with LLMs — Ray 2.48.0

Does anyone know why this is or if I’m missing anything? I’ve tried playing with the concurrency setting as well which didn’t help at all.