For example, we read a csv or parquet file into a ray dataset, would the following code get the same result?
# ray dataset
for batch in ray_dataset.iter_batches():
for row in batch:
# raw file
for row in pd.read_parquet('example.parquet').iterrows():
What about a shard of the dataset, would we get the same order as the original file?
shard1, _, _ = ds.split_at_indices([2000, 5000])
# same order as the original_file[:2000] ?
for row in shard1.iter_batches():
Can I get any help here? No explicit docs could be found for this.
can you try setting
ray.data.context.DatasetContext.get_current().execution_options.preserve_order = True like this example here?
It seems this options is not supported in ray==2.2.0, How to get this option for ray 2.2.0? Order is not preserved for all versions of ray?
ray Dataset is undergoing really active development.
I’d actually recommend you always use the latest version for performance and functionality improvements.
I am using a library that requires ray==2.2.0, anyway, I will implement it myself without using ray dataset api since ray dataset does not work.