Process/Materialize Data In Input Order

Bala · March 28, 2024, 11:04am

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Firstly, greetings to everyone. This is my first newbie question. I am using Ray.Data to read a csv file, apply some map and filter functions, and materialize the results. Then i use iter_rows() to do post-processing. In this stage, i need to maintain the order of my input data. I see that Ray changes the order of data every time i re-run the code. Is there any way to indicate either to read_csv or materialize to maintain the input order. I am looking for any expert advice in this regard.

The work around i am thinking is to introduce an additional column to my input data (range of integers), and sort the dataset prior to materialize step. However, i assume that this is gonna to increase processing time as sorting is tedious.

Bala · March 29, 2024, 9:43am

Just wanted to update that while i got my workaround done, i also tried the following suggested in a previous post that i failed to notice before. The following worked and results got correct.

ray.data.context.DatasetContext.get_current().execution_options.preserve_order = True

Topic		Replies	Views
Does `ray.data.Dataset.iter_batches` guarantee order of the original file? Ray Data	5	367	May 29, 2023
Ray data.read_csv keeps pausing Ray Data	3	297	September 28, 2023
Recommendational steps for processing big data? Ray Data	1	415	July 7, 2023
Does Ray remote functions return in correct order? Ray Core	1	234	February 24, 2023
Changing ray task execution order Ray Core	1	87	January 3, 2024

Process/Materialize Data In Input Order

Related Topics