Massive disk usage when using ray.data

Hey Michał, welcome to Ray!

  • Taking a look at tf.distribute.Strategy.experimental_distribute_dataset, it looks like this reshards and rebatches the dataset. Since Ray Datasets already shards and batches this dataset under-the-hood, this is unnecessary and inefficient; you should be able to directly call model.fit() in the distributed trainers, which should be more efficient.
  • An alternative is to use Ray AIR – check out this Keras/TensorFlow training example for more details. However, your desired logic for the per-epoch shuffle on datapoint IDs isn’t quite supported yet, but will be soon.

If the above didn’t resolve the issue, could you also provide a minimal reproducible example that could help us further debug the issue? Thanks!