Incremental data load using Ray Dataset

I am trying to create a fast, distributed in-memory data store using ray. The source is partitioned by dates. I load the entire dataset once into multiple shards using split() API and keep it ready for access. Since the dataset is quite large it takes a long time read into ray dataset.

What I want to achieve next is somehow add an incremental logic to this data load where I read only what’s changed from last load and append it.

We pass on the changed only paths to ray.data.read_parquet() API call. But I am not sure what’s the recommended way to “append” the object references which are already present?

@Dmitri, @rliaw do you have any recommendations for handling such case?

You can use union to combine two Datasets, for example, ds1.union(ds2).