Given a lazily produced Dataset:
ray.data.from_items(parts_list).map_batches(something).map_batches(something_else) and so on, is it possible to split this into separate Datasets or DataIterators by partitions of the parts_list, or do I have to create the datasets separately?
I know Dataset.split() splits the Dataset, but it has to materialize it, and I don’t want that, since I’m going to do distributed loading on the various parts.
Dataset.streaming_split seems to be what I want for splitting without materializing, but the problem is it’s very intolerant of worker failures, since all of the iterators have to be iterated over simultaneously.
The context of my problem: I am trying to split my Dataset into approximately equal, but not necessarily completely equal, pieces that can be loaded onto a distributed collection of actors for training. I’m putting each of the actors onto preemptible cloud instances, so I’d like the ability to recover individual partitions of the data without reloading the whole thing. The entire operation is memory-constrained, so I don’t want to maintain a copy of the data inside the object store, preferably instead in regular heap memory.