How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
Ray Data’s implementation of sort for Dataset is not very efficient. Reading through its code, it appears that data is sorted in both the mappers and reducers. I don’t think this is necessary. For example, we could just distribute the data among various boundaries in mappers and sort them later in reducers, or, sort them in mappers and later just heap merge them in the reducers, which both ways I believe could significantly improve performance.