I am trying to create a fast, distributed in-memory data store using ray. The source is partitioned by dates. I load the entire dataset once into multiple shards using split() API and keep it ready for access. Since the dataset is quite large it takes a long time read into ray dataset.
What I want to achieve next is somehow add an incremental logic to this data load where I read only what’s changed from last load and append it.
We pass on the changed only paths to ray.data.read_parquet() API call. But I am not sure what’s the recommended way to “append” the object references which are already present?
@Dmitri, @rliaw do you have any recommendations for handling such case?