How to stream data directly from s3

Aleksei · February 26, 2024, 3:55pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We need to train pytorch models on large datasets 500GB-2TB.
We are going to use ray data+train, so currently we are designing our solution.
It seems that we need pipelining from s3 because

Allocating 500GB-2TB of memory for storing dataset in global storage would be expensive
Enabling spilling would affect performance.
S3 speed could be boosted up to 10 GB/s which is faster than any disk and could be compared to training from memory

So it looks like training on data streamed directly from s3 without storing the whole dataset on disk/memory/GCS would be an optimal solution.

Does ray support this mode?
Can you help me with where to start to do that?

justinvyu · February 29, 2024, 1:35am

Yes! We recommend you to use Ray Data + Ray Train to stream data directly from S3 and split it to your distributed training workers on the fly.

See this user guide: Data Loading and Preprocessing — Ray 2.9.3

See here for how to load data from s3: Loading Data — Ray 2.9.3

Let me know if you run into any problems!

Aleksei · March 4, 2024, 10:59am

Thank you, it works as expected.

Topic		Replies	Views
Can Ray Dataset be used between S3 and PyTorch? Ray Data	4	1153	February 17, 2022
Using ray datasets with pytorch lightning	0	321	November 22, 2023
RayTune Downloading Data from S3 Kubernetes	0	172	February 12, 2024
Ray Data streaming not streaming smoothly Ray Data	8	770	May 30, 2023
Data Retrieval Best Practices Ray Client	7	662	April 11, 2023

How to stream data directly from s3

Related topics