Dataset Range in Arrow

Some time ago a Ray developer suggested the best way to add an ID column to a data set was to create a dataset using and then zip it to the original data set:

ds2 =
ds2 = ds2.repartition(data.num_blocks())
data =

But when I try that I get an error that the range dataset is in the wrong format:
ValueError: Cannot zip <class ''> with block of type <class 'list'>

How do I create the range in Arrow format? Or alternatively, how do I convert the list data set to an arrow table?

Also, is there a way to specify how many partitions to use when creating the range so that I don’t need to repartition it to match?

Thank you!

Hi @brian, I replied on the Ray slack, but post here as well for record.

For format conversion: You can convert it with ds = ds.map_batches(lambda x: x, batch_format="pyarrow") which is an identity mapping, but with the batch_format="pyarrow" it will output the dataset in that format.

For controlling the number of partitions: You can just use e.g. ds =, parallelism=10). The parallelism is the parameter to set desired num of partitions.