For example, we read a csv or parquet file into a ray dataset, would the following code get the same result?
# ray dataset
for batch in ray_dataset.iter_batches():
for row in batch:
print(row)
# raw file
for row in pd.read_parquet('example.parquet').iterrows():
print(row)
What about a shard of the dataset, would we get the same order as the original file?
shard1, _, _ = ds.split_at_indices([2000, 5000])
# same order as the original_file[:2000] ?
for row in shard1.iter_batches():
print(row)
ray Dataset is undergoing really active development.
I’d actually recommend you always use the latest version for performance and functionality improvements.