Massive disk usage when using ray.data

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,

First of all, I want to say that I’m completely new to Ray. I’ve decided to use ray.data for dataloading in my company, but I’ve experienced massive disk usage (2+TB), while the whole dataset is at most 500GB (in memory). The machine runs on 72 CPU cores, 4 GPUs and 100GB of system RAM.

Our pipeline looks something like that:

  • We start with a list of datapoint ids which I load into ray dataset using ray.data.from_items and quickly convert into DatasetPipeline using repeat(1).
  • That way we can introduce a global shuffle at the id level (low memory footprint), that happens every epoch using random_shuffle_each_window.
  • Then we do repeat(None) in order to be able to run infinite epochs.
  • rewindow(blocks_per_window=4, preserve_epoch=True)
  • Then we apply several mappings on the dataset (using map, our code is not yet vectorized):
    • id → datapoint (each aprox 7-8MB)
    • datapoint → Dict[str, np.ndarray]
    • augmentations
  • At that point we do iter_epochs()to clreate a ds_iter and start the training loop
  • for each epoch we convert the next(ds_iter) using to_tf(..., batch_size=8, drop_last=True) and then convert it to distribute dataset using strategy.experimental_distribute_dataset as per TF documentation.

With each epoch the disk usage gets higher even though we do not require any cached data, as we want to repeat the whole data loading process (including the shuffle) each epoch. This is actually preventing us from switching to ray for data loading. I’ve dug, but was unable to find any way to control the amount of disk space used by Ray.

Do you see any potential problems with my approach? Do you have any recommendations on how to apply ray.data to our data pipeline scenario? Is it normal for Ray to eat up so much disk space?

Best regards
Michał Majczak

Hey Michał, welcome to Ray!

  • Taking a look at tf.distribute.Strategy.experimental_distribute_dataset, it looks like this reshards and rebatches the dataset. Since Ray Datasets already shards and batches this dataset under-the-hood, this is unnecessary and inefficient; you should be able to directly call model.fit() in the distributed trainers, which should be more efficient.
  • An alternative is to use Ray AIR – check out this Keras/TensorFlow training example for more details. However, your desired logic for the per-epoch shuffle on datapoint IDs isn’t quite supported yet, but will be soon.

If the above didn’t resolve the issue, could you also provide a minimal reproducible example that could help us further debug the issue? Thanks!