I’m loving ray.data and its distribution and lazy mapping for transformation; it’s great! But it’s early days and I’m having trouble with a step… I think I want almost the reverse of flat_map, a transform which takes in multiple rows and outputs a single row, which isn’t what
map_batches does (when I tried it).
The model we’re using operates on objects representing a batch of data (this is distinct from “a list or array of data objects” here). So I can use ray.data like this for evaluation (mild pseudocode):
data = ( ray.data.from_items(data_items) .map(transform1) .map(transform2) ) # but now I have to step away and manually iterate over the batches batch_objects = [create_batch_object(batch) for batch in data.iter_batches(batch_size)] # and call the model with batch objects results = [model(bobj) for bobj in batch_objects] # and then flatten the results and unpack them result_batches = flatten([unpack_batch_object(bobj) for bobj in results])
I can’t find a
Dataset function that aggregates multiple rows in batches, just the
iter_batches which pulls me out of the lazy flow. I would prefer something like
data = ( ray.data.from_items(data_items) .map(transform1) .map(transform2) .batch_map(fn_taking_multiple_rows_and_returning_one_row) .map(model) .flat_map(unpack_batch) )
.batch_map “maps over batches”, taking a sequence of rows and returning one row.
Maybe this exists and despite blearily scanning the docs I just missed it; I thought
map_batches would do this but it seems to map rows, just a batch at a time for better vectorization.
Does this exist?