Implementation of sort is not optimal

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Ray Data’s implementation of sort for Dataset is not very efficient. Reading through its code, it appears that data is sorted in both the mappers and reducers. I don’t think this is necessary. For example, we could just distribute the data among various boundaries in mappers and sort them later in reducers, or, sort them in mappers and later just heap merge them in the reducers, which both ways I believe could significantly improve performance.

Hi @z4y1b2 , thanks for your suggestions on this, we are always looking for suggestions on performance improvements and optimizations. It would be great if you can open a Ray feature request on GitHub to track discussion and discuss in more details, and even submit a PR with the improvements!