Split operation optimization

pwang1234 · January 31, 2024, 6:07pm

Given a lazily produced Dataset:

ray.data.from_items(parts_list).map_batches(something).map_batches(something_else) and so on, is it possible to split this into separate Datasets or DataIterators by partitions of the parts_list, or do I have to create the datasets separately?

I know Dataset.split() splits the Dataset, but it has to materialize it, and I don’t want that, since I’m going to do distributed loading on the various parts.

Dataset.streaming_split seems to be what I want for splitting without materializing, but the problem is it’s very intolerant of worker failures, since all of the iterators have to be iterated over simultaneously.

The context of my problem: I am trying to split my Dataset into approximately equal, but not necessarily completely equal, pieces that can be loaded onto a distributed collection of actors for training. I’m putting each of the actors onto preemptible cloud instances, so I’d like the ability to recover individual partitions of the data without reloading the whole thing. The entire operation is memory-constrained, so I don’t want to maintain a copy of the data inside the object store, preferably instead in regular heap memory.

Topic		Replies	Views
Ray datasets streaming block split? Ray Data	1	661	June 27, 2023
[Train] Using Datasets is MUCH slower then instantiating data in workers	0	76	August 27, 2024
Running batches of data by multiple work process Ray Core	5	524	April 6, 2022
Data set access per range by worker process	0	353	April 5, 2022
Hive Partitioned Datasets	0	461	July 3, 2023

Split operation optimization

Related topics