Dataset Range in Arrow

brian · January 18, 2023, 4:14pm

Some time ago a Ray developer suggested the best way to add an ID column to a data set was to create a dataset using ray.data.range and then zip it to the original data set:

ds2 = ray.data.range(data.count())
ds2 = ds2.repartition(data.num_blocks())
data = data.zip(ds2)

But when I try that I get an error that the range dataset is in the wrong format:
ValueError: Cannot zip <class 'ray.data._internal.arrow_block.ArrowBlockAccessor'> with block of type <class 'list'>

How do I create the range in Arrow format? Or alternatively, how do I convert the list data set to an arrow table?

Also, is there a way to specify how many partitions to use when creating the range so that I don’t need to repartition it to match?

Thank you!

jianxiao · January 26, 2023, 5:32pm

Hi @brian, I replied on the Ray slack, but post here as well for record.

For format conversion: You can convert it with ds = ds.map_batches(lambda x: x, batch_format="pyarrow") which is an identity mapping, but with the batch_format="pyarrow" it will output the dataset in that format.

For controlling the number of partitions: You can just use e.g. ds = ray.data.range(100, parallelism=10). The parallelism is the parameter to set desired num of partitions.

Topic		Replies	Views
Create Ray dataset from numy array Ray Core	1	354	April 12, 2022
Does ray dataset support a display method similar to dataframe Ray Data	5	562	January 16, 2023
Use 'zip' to concat two dataset	1	331	August 26, 2022
How to deal with labeled image datasets? Ray Data	11	658	May 31, 2023
Dataset in Pandas Returns Arrow Argument When Materializing Ray Data	0	278	May 22, 2024

Dataset Range in Arrow

Related topics