Implementation of sort is not optimal

z4y1b2 · September 14, 2023, 3:40am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Ray Data’s implementation of sort for Dataset is not very efficient. Reading through its code, it appears that data is sorted in both the mappers and reducers. I don’t think this is necessary. For example, we could just distribute the data among various boundaries in mappers and sort them later in reducers, or, sort them in mappers and later just heap merge them in the reducers, which both ways I believe could significantly improve performance.

sjl · September 20, 2023, 11:28pm

Hi @z4y1b2 , thanks for your suggestions on this, we are always looking for suggestions on performance improvements and optimizations. It would be great if you can open a Ray feature request on GitHub to track discussion and discuss in more details, and even submit a PR with the improvements!

Topic		Replies	Views
Bucketing in Ray Dataset?	1	41	November 18, 2024
Benchmarks for Ray Data? Ray Data	13	1020	October 5, 2023
Dataset statistics best practice Ray Data	2	490	January 14, 2023
Groupby performance issues with many small groups Ray Data	1	460	October 25, 2023
Process/Materialize Data In Input Order Ray Data	1	237	March 29, 2024

Implementation of sort is not optimal

Related topics