Massive disk usage when using ray.data

Muniu · January 9, 2023, 3:51pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello,

First of all, I want to say that I’m completely new to Ray. I’ve decided to use ray.data for dataloading in my company, but I’ve experienced massive disk usage (2+TB), while the whole dataset is at most 500GB (in memory). The machine runs on 72 CPU cores, 4 GPUs and 100GB of system RAM.

Our pipeline looks something like that:

We start with a list of datapoint ids which I load into ray dataset using ray.data.from_items and quickly convert into DatasetPipeline using repeat(1).
That way we can introduce a global shuffle at the id level (low memory footprint), that happens every epoch using random_shuffle_each_window.
Then we do repeat(None) in order to be able to run infinite epochs.
rewindow(blocks_per_window=4, preserve_epoch=True)
Then we apply several mappings on the dataset (using map, our code is not yet vectorized):
- id → datapoint (each aprox 7-8MB)
- datapoint → Dict[str, np.ndarray]
- augmentations
At that point we do iter_epochs()to clreate a ds_iter and start the training loop
for each epoch we convert the next(ds_iter) using to_tf(..., batch_size=8, drop_last=True) and then convert it to distribute dataset using strategy.experimental_distribute_dataset as per TF documentation.

With each epoch the disk usage gets higher even though we do not require any cached data, as we want to repeat the whole data loading process (including the shuffle) each epoch. This is actually preventing us from switching to ray for data loading. I’ve dug, but was unable to find any way to control the amount of disk space used by Ray.

Do you see any potential problems with my approach? Do you have any recommendations on how to apply ray.data to our data pipeline scenario? Is it normal for Ray to eat up so much disk space?

Best regards
Michał Majczak

sjl · January 9, 2023, 10:31pm

Hey Michał, welcome to Ray!

Taking a look at tf.distribute.Strategy.experimental_distribute_dataset, it looks like this reshards and rebatches the dataset. Since Ray Datasets already shards and batches this dataset under-the-hood, this is unnecessary and inefficient; you should be able to directly call model.fit() in the distributed trainers, which should be more efficient.
An alternative is to use Ray AIR – check out this Keras/TensorFlow training example for more details. However, your desired logic for the per-epoch shuffle on datapoint IDs isn’t quite supported yet, but will be soon.

If the above didn’t resolve the issue, could you also provide a minimal reproducible example that could help us further debug the issue? Thanks!

Topic		Replies	Views
Shared dataset on a local desktop	1	294	March 7, 2023
Memory exhausting problem when using Dataset (from ray.data) with RLLib RLlib	2	291	October 12, 2022
Data loading of parquet files is very memory consuming Ray Data	2	1451	June 21, 2022
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1294	February 14, 2023
Working with large data that do not fit on the disks of a cluster Ray Data	6	764	July 13, 2022

Massive disk usage when using ray.data

Related topics