How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello,
First of all, I want to say that I’m completely new to Ray. I’ve decided to use ray.data for dataloading in my company, but I’ve experienced massive disk usage (2+TB), while the whole dataset is at most 500GB (in memory). The machine runs on 72 CPU cores, 4 GPUs and 100GB of system RAM.
Our pipeline looks something like that:
- We start with a list of datapoint ids which I load into ray dataset using
ray.data.from_items
and quickly convert into DatasetPipeline usingrepeat(1)
. - That way we can introduce a global shuffle at the id level (low memory footprint), that happens every epoch using
random_shuffle_each_window
. - Then we do
repeat(None)
in order to be able to run infinite epochs. rewindow(blocks_per_window=4, preserve_epoch=True)
- Then we apply several mappings on the dataset (using
map
, our code is not yet vectorized):- id → datapoint (each aprox 7-8MB)
- datapoint → Dict[str, np.ndarray]
- augmentations
- At that point we do
iter_epochs()
to clreate ads_iter
and start the training loop - for each epoch we convert the
next(ds_iter)
usingto_tf(..., batch_size=8, drop_last=True)
and then convert it to distribute dataset usingstrategy.experimental_distribute_dataset
as per TF documentation.
With each epoch the disk usage gets higher even though we do not require any cached data, as we want to repeat the whole data loading process (including the shuffle) each epoch. This is actually preventing us from switching to ray for data loading. I’ve dug, but was unable to find any way to control the amount of disk space used by Ray.
Do you see any potential problems with my approach? Do you have any recommendations on how to apply ray.data to our data pipeline scenario? Is it normal for Ray to eat up so much disk space?
Best regards
Michał Majczak