Massive disk usage when using ray.data

sjl · January 9, 2023, 10:31pm

Hey Michał, welcome to Ray!

Taking a look at tf.distribute.Strategy.experimental_distribute_dataset, it looks like this reshards and rebatches the dataset. Since Ray Datasets already shards and batches this dataset under-the-hood, this is unnecessary and inefficient; you should be able to directly call model.fit() in the distributed trainers, which should be more efficient.
An alternative is to use Ray AIR – check out this Keras/TensorFlow training example for more details. However, your desired logic for the per-epoch shuffle on datapoint IDs isn’t quite supported yet, but will be soon.

If the above didn’t resolve the issue, could you also provide a minimal reproducible example that could help us further debug the issue? Thanks!

Topic		Replies	Views
Shared dataset on a local desktop	1	307	March 7, 2023
Memory exhausting problem when using Dataset (from ray.data) with RLLib RLlib	2	309	October 12, 2022
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1356	February 14, 2023
Data loading of parquet files is very memory consuming Ray Data	2	1485	June 21, 2022
Ray.data.read_csv Huge Dataset memory limitations	0	254	September 5, 2023