Dataset sort on lists/tuples

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.
2. Environment:

  • Ray version: 2.52.1
  • Python version: 3.12/3.13
  • OS: MacOS/Linux

3. What happened vs. what you expected:

Here’s a minimal example of my problem I encountered in a more complex project.

I’m generating a series of tuples and want to sort them:

import ray
import ray.data

ray.init()

ds = ray.data.from_items(
    [
        (x,y) for y in range(3) for x in range(3)
    ]
)

ds.sort("item").show()

This minimal example results in an exception (see below) - no way around it. I don’t expect this behaviour, as one can sort tuples in native Python.

I’m not sure whether this failure is intended.

To sort this dataset, I added a column for each value of the tuple… and removed those after the sorting. Alternatively, I could generate the data in separate columns, but I actually want to consume a tuple in a later step.

Do you have any other suggestions?

If this behaviour is actually intended: It would be great to add this type of limitation to the ray.data documentation - or explain the sort logic, so the behaviour becomes intuitive.

Traceback (most recent call last):
File “/Users/achim/Code/ray-sort/./ray-sort-list.py”, line 12, in
ds.sort(“item”).show()
~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/dataset.py”, line 3570, in show
for row in self.take(limit):
~~~~~~~~~^^^^^^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/dataset.py”, line 3492, in take
for row in limited_ds.iter_rows():
~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/iterator.py”, line 246, in _wrapped_iterator
for batch in batch_iterable:
^^^^^^^^^^^^^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/iterator.py”, line 190, in _create_iterator
) = self._to_ref_bundle_iterator()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/_internal/iterator/iterator_impl.py”, line 27, in _to_ref_bundle_iterator
ref_bundles_iterator, stats = self._base_dataset._execute_to_iterator()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/dataset.py”, line 6586, in _execute_to_iterator
bundle_iter, stats, executor = self._plan.execute_to_iterator()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File “/usr/local/Caskroom/miniforge/base/envs/ray-only/lib/python3.13/site-packages/ray/data/exceptions.py”, line 89, in handle_trace
raise e.with_traceback(None) from SystemException()
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

It could be possible that Ray Data’s sort does not support sorting datasets where each row is a tuple stored as a single column named “item”. The sort operation expects to sort by a column with scalar values, not by a column containing tuples, which may lead to ambiguous comparisons and errors, as you observed. This limitation is not clearly documented, but the error you encountered is consistent with Ray’s current sort logic, which is designed for tabular data with named columns rather than single columns of tuples. According to the Ray Dataset.sort documentation, the key parameter should be a column name, and the examples always use scalar columns.

Though I’m not entirely certain, it might be that your workaround—splitting the tuple into separate columns, sorting, and then recombining if needed—is the most reliable approach with current Ray Data. There does not appear to be a built-in way to sort directly on tuple-valued columns. Would you like more detail on how to implement this workaround or on the internals of Ray’s sort logic?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Thanks for rehashing my post perfectly.

My expectations are based on how sort works with tuples or how numpy.sort works on arrays.

As Ray.IO is proud of being Pythonic, one would expect to be able to sort objects. The notion of scalars vs arrays is introduced by libraries like numpy, pandas or pyarrow to distinguish elementwise operations.

The difference is illustrated here:

>>> import numpy
>>> numpy.array([1,2]) < numpy.array([0])
array([False, False])
>>> [1,2] < [0]
False

You are correct that native Python allows sorting of objects like tuples, but Ray Data’s sort is designed around tabular data with named, scalar columns, following the conventions of pandas and pyarrow. As a result, sorting on columns containing complex objects (like tuples or arrays) is not supported and leads to errors, as the underlying block types (pandas/pyarrow) expect scalar values for sorting operations. This is a limitation of Ray Data’s current design, not a general Pythonic restriction—see Ray Dataset.sort documentation.

If you need to sort by tuple values, the recommended approach is to expand the tuple into separate columns, sort using those columns, and then (optionally) recombine if needed. Would you like a code example for this workaround or more detail on why Ray Data enforces this restriction?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Is there a solution that works better than this functioning example?

import ray
import ray.data
import ray.data.expressions

ray.init()

ds = ray.data.from_items([(x, y) for y in range(3) for x in range(3)])

ds = ds.with_column("item-0", ray.data.expressions.col("item").list.get(0))
ds = ds.with_column("item-1", ray.data.expressions.col("item").list.get(1))

ds = ds.sort(["item-0", "item-1"])

ds = ds.drop_columns(["item-0", "item-1"])

ds.show()

There is currently no simpler or more “native” solution in Ray Data than your approach: expanding tuple elements into separate columns, sorting, and then dropping the helper columns. Ray Data’s sort requires scalar columns, and does not support sorting directly on columns containing tuples or arrays, as confirmed by the Ray Dataset.sort documentation.

Your workaround is the recommended method for sorting by tuple values in Ray Data. There is no built-in support for sorting on object columns, and no alternative that avoids this column expansion step. Would you like more detail on the technical reasons for this limitation?

Sources:

Hint: Mention @RunLLM in the post for followups.