How to stream data directly from s3

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We need to train pytorch models on large datasets 500GB-2TB.
We are going to use ray data+train, so currently we are designing our solution.
It seems that we need pipelining from s3 because

  1. Allocating 500GB-2TB of memory for storing dataset in global storage would be expensive
  2. Enabling spilling would affect performance.
  3. S3 speed could be boosted up to 10 GB/s which is faster than any disk and could be compared to training from memory

So it looks like training on data streamed directly from s3 without storing the whole dataset on disk/memory/GCS would be an optimal solution.

Does ray support this mode?
Can you help me with where to start to do that?

Yes! We recommend you to use Ray Data + Ray Train to stream data directly from S3 and split it to your distributed training workers on the fly.

See this user guide: Data Loading and Preprocessing — Ray 2.9.3

See here for how to load data from s3: Loading Data — Ray 2.9.3

Let me know if you run into any problems!

Thank you, it works as expected.