0
I need to do inference using vllm for large dataset, code structure as below:
ds = ray.data.read_parquet(my_input_path)
ds = input_data.map_batches(
LLMPredictor,
concurrency=ray_concurrency,
...
**resources_kwarg
)
ds.write_parquet(my_output_path)
My input data is a S3 path contains lots of parquet data, each file ~10MB
What I observed is for each node, the write process start only when all inference jobs finished. Is there a way to achieve streaming write? like every n batch we do a write.
The reason is
- When doing inference, only GPUs are working and CPUs are idle, don’t want to waste CPU resources at the moment
- If the dataset is large (~100GB), I don’t want to store the whole result in memory which may cause OOM, and I want to see inference result earlier, as long as inference result is generated
Does ray support it, how can I achieve it?
Thank you