How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
We need to train pytorch models on large datasets 500GB-2TB.
We are going to use ray data+train, so currently we are designing our solution.
It seems that we need pipelining from s3 because
- Allocating 500GB-2TB of memory for storing dataset in global storage would be expensive
- Enabling spilling would affect performance.
- S3 speed could be boosted up to 10 GB/s which is faster than any disk and could be compared to training from memory
So it looks like training on data streamed directly from s3 without storing the whole dataset on disk/memory/GCS would be an optimal solution.
Does ray support this mode?
Can you help me with where to start to do that?