Incremental data load using Ray Dataset

Preeti_Joshi · December 14, 2021, 12:15pm

I am trying to create a fast, distributed in-memory data store using ray. The source is partitioned by dates. I load the entire dataset once into multiple shards using split() API and keep it ready for access. Since the dataset is quite large it takes a long time read into ray dataset.

What I want to achieve next is somehow add an incremental logic to this data load where I read only what’s changed from last load and append it.

We pass on the changed only paths to ray.data.read_parquet() API call. But I am not sure what’s the recommended way to “append” the object references which are already present?

@Dmitri, @rliaw do you have any recommendations for handling such case?

ericl · December 21, 2021, 1:17am

You can use union to combine two Datasets, for example, ds1.union(ds2).

Topic		Replies	Views
Ray Data read Parquet loads all the data in one go	4	581	October 21, 2023
How to create a Ray dataset from distributed partitions? Ray Data	7	820	April 5, 2023
Problem with anything on Ray Ray Data	2	629	April 20, 2022
About the Ray Data category Ray Data	1	742	April 14, 2025
Loading pyarrow.Table directly into ObjectStore from Parquet Ray Core	1	277	November 20, 2023

Incremental data load using Ray Dataset

Related topics