Dataset sort on lists/tuples

achimgaedke · December 20, 2025, 3:52am

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.
2. Environment:

Ray version: 2.52.1
Python version: 3.12/3.13
OS: MacOS/Linux

3. What happened vs. what you expected:

Here’s a minimal example of my problem I encountered in a more complex project.

I’m generating a series of tuples and want to sort them:

import ray
import ray.data

ray.init()

ds = ray.data.from_items(
    [
        (x,y) for y in range(3) for x in range(3)
    ]
)

ds.sort("item").show()

This minimal example results in an exception (see below) - no way around it. I don’t expect this behaviour, as one can sort tuples in native Python.

I’m not sure whether this failure is intended.

To sort this dataset, I added a column for each value of the tuple… and removed those after the sorting. Alternatively, I could generate the data in separate columns, but I actually want to consume a tuple in a later step.

Do you have any other suggestions?

If this behaviour is actually intended: It would be great to add this type of limitation to the ray.data documentation - or explain the sort logic, so the behaviour becomes intuitive.

Traceback (most recent call last):
File “/Users/achim/Code/ray-sort/./ray-sort-list.py”, line 12, in
ds.sort(“item”).show()
~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/dataset.py”, line 3570, in show
for row in self.take(limit):
~~~~~~~~~^^^^^^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/dataset.py”, line 3492, in take
for row in limited_ds.iter_rows():
~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/iterator.py”, line 246, in _wrapped_iterator
for batch in batch_iterable:
^^^^^^^^^^^^^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/iterator.py”, line 190, in _create_iterator
) = self._to_ref_bundle_iterator()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/_internal/iterator/iterator_impl.py”, line 27, in _to_ref_bundle_iterator
ref_bundles_iterator, stats = self._base_dataset._execute_to_iterator()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/dataset.py”, line 6586, in _execute_to_iterator
bundle_iter, stats, executor = self._plan.execute_to_iterator()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/exceptions.py”, line 89, in handle_trace
raise e.with_traceback(None) from SystemException()
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

RunLLM · December 20, 2025, 3:52am

It could be possible that Ray Data’s sort does not support sorting datasets where each row is a tuple stored as a single column named “item”. The sort operation expects to sort by a column with scalar values, not by a column containing tuples, which may lead to ambiguous comparisons and errors, as you observed. This limitation is not clearly documented, but the error you encountered is consistent with Ray’s current sort logic, which is designed for tabular data with named columns rather than single columns of tuples. According to the Ray Dataset.sort documentation, the key parameter should be a column name, and the examples always use scalar columns.

Though I’m not entirely certain, it might be that your workaround—splitting the tuple into separate columns, sorting, and then recombining if needed—is the most reliable approach with current Ray Data. There does not appear to be a built-in way to sort directly on tuple-valued columns. Would you like more detail on how to implement this workaround or on the internals of Ray’s sort logic?

Sources:

Ray Dataset.sort documentation

Hint: Mention @RunLLM in the post for followups.

achimgaedke · December 20, 2025, 5:00am

@RunLLM Thanks for rehashing my post perfectly.

My expectations are based on how sort works with tuples or how numpy.sort works on arrays.

As Ray.IO is proud of being Pythonic, one would expect to be able to sort objects. The notion of scalars vs arrays is introduced by libraries like numpy, pandas or pyarrow to distinguish elementwise operations.

The difference is illustrated here:

>>> import numpy
>>> numpy.array([1,2]) < numpy.array([0])
array([False, False])
>>> [1,2] < [0]
False

RunLLM · December 20, 2025, 5:00am

You are correct that native Python allows sorting of objects like tuples, but Ray Data’s sort is designed around tabular data with named, scalar columns, following the conventions of pandas and pyarrow. As a result, sorting on columns containing complex objects (like tuples or arrays) is not supported and leads to errors, as the underlying block types (pandas/pyarrow) expect scalar values for sorting operations. This is a limitation of Ray Data’s current design, not a general Pythonic restriction—see Ray Dataset.sort documentation.

If you need to sort by tuple values, the recommended approach is to expand the tuple into separate columns, sort using those columns, and then (optionally) recombine if needed. Would you like a code example for this workaround or more detail on why Ray Data enforces this restriction?

Sources:

Ray Dataset.sort documentation

Hint: Mention @RunLLM in the post for followups.

achimgaedke · December 21, 2025, 7:28am

@RunLLM Is there a solution that works better than this functioning example?

import ray
import ray.data
import ray.data.expressions

ray.init()

ds = ray.data.from_items([(x, y) for y in range(3) for x in range(3)])

ds = ds.with_column("item-0", ray.data.expressions.col("item").list.get(0))
ds = ds.with_column("item-1", ray.data.expressions.col("item").list.get(1))

ds = ds.sort(["item-0", "item-1"])

ds = ds.drop_columns(["item-0", "item-1"])

ds.show()

RunLLM · December 21, 2025, 7:28am

There is currently no simpler or more “native” solution in Ray Data than your approach: expanding tuple elements into separate columns, sorting, and then dropping the helper columns. Ray Data’s sort requires scalar columns, and does not support sorting directly on columns containing tuples or arrays, as confirmed by the Ray Dataset.sort documentation.

Your workaround is the recommended method for sorting by tuple values in Ray Data. There is no built-in support for sorting on object columns, and no alternative that avoids this column expansion step. Would you like more detail on the technical reasons for this limitation?

Sources:

Ray Dataset.sort documentation

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How to deal with labeled image datasets? Ray Data	11	730	May 31, 2023
Process/Materialize Data In Input Order Ray Data	1	298	March 29, 2024
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	2174	November 17, 2022
Ray Column With Custom Python Dataclass Type Ray Data	3	294	May 22, 2025
Does ray dataset support a display method similar to dataframe Ray Data	5	599	January 16, 2023

Dataset sort on lists/tuples

Related topics