Interleaving file reads with custom datasource

jpthalman · January 23, 2024, 11:17pm

I have a dataset of ~100k files with a custom format and each file contains a variable number of rows. I’ve implemented a custom datasource where each read task gets a list of N files, opens each file sequentially, and yields a DataFrame with a single row for each example in the file. I load this datasource using dataset.read_datasource and then I iterate over it using iter_batches. The problem I’m seeing is that every file contains highly correlated data so it’s important that each batch element comes from a unique file, but I haven’t found a way to achieve this within ray. My current solution is to post process the output batches from ray and cache the rows until I have enough examples from unique files to yield a batch with rows from unique files, but I’m wondering if there’s a more memory efficient way to do this. I’m basically looking for the equivalent of tf.data.Dataset.interleave.

Topic		Replies	Views
Ray Data and Train connection options	0	93	April 12, 2024
Ray dataset from IterableDataset. No lazy implementation?	0	53	November 15, 2024
How to convert Pytorch torch.utils.data.Dataset to ray.data.dataset?	15	1380	December 8, 2022
Best practices around handling giant datasets with ray data (large amount of read tasks)	5	142	October 15, 2024
Ray Data: How to yield entire groups from a batch?	5	269	January 27, 2024

Interleaving file reads with custom datasource

Related topics