Dataset Range in Arrow

Some time ago a Ray developer suggested the best way to add an ID column to a data set was to create a dataset using ray.data.range and then zip it to the original data set:

ds2 = ray.data.range(data.count())
ds2 = ds2.repartition(data.num_blocks())
data = data.zip(ds2)

But when I try that I get an error that the range dataset is in the wrong format:
ValueError: Cannot zip <class 'ray.data._internal.arrow_block.ArrowBlockAccessor'> with block of type <class 'list'>

How do I create the range in Arrow format? Or alternatively, how do I convert the list data set to an arrow table?

Also, is there a way to specify how many partitions to use when creating the range so that I don’t need to repartition it to match?

Thank you!

Hi @brian, I replied on the Ray slack, but post here as well for record.

For format conversion: You can convert it with ds = ds.map_batches(lambda x: x, batch_format="pyarrow") which is an identity mapping, but with the batch_format="pyarrow" it will output the dataset in that format.

For controlling the number of partitions: You can just use e.g. ds = ray.data.range(100, parallelism=10). The parallelism is the parameter to set desired num of partitions.