Some time ago a Ray developer suggested the best way to add an ID column to a data set was to create a dataset using ray.data.range
and then zip it to the original data set:
ds2 = ray.data.range(data.count())
ds2 = ds2.repartition(data.num_blocks())
data = data.zip(ds2)
But when I try that I get an error that the range dataset is in the wrong format:
ValueError: Cannot zip <class 'ray.data._internal.arrow_block.ArrowBlockAccessor'> with block of type <class 'list'>
How do I create the range in Arrow format? Or alternatively, how do I convert the list data set to an arrow table?
Also, is there a way to specify how many partitions to use when creating the range so that I don’t need to repartition it to match?
Thank you!