[Data] Pandas throwing error when iterating over batches using `Dataset.iter_batches()`

I’m getting the following error (from Pandas) when iterating over batches using Dataset.iter_batches():
TypeError: cannot do slice indexing on RangeIndex with these indexers [288320.0] of type float
Full traceback at the bottom.

Any help would be appreciated.

Details

This is running on an Anyscale cluster with base image anyscale/ray-ml:2.9.0-py39-cpu

Weirdly, I don’t get the error for smaller datasets (up to 10^8 rows, ~10GB total CSV file size) with the same schema. The error shows up on a datasets with 5 * 10^8 rows (~50GB in total CSV file size) and larger.

Code

In all relevant cases, datasets contains just a single Ray Dataset loaded from either CSV or Parquet files on S3.
batch_sizes = [1_000_000] and flags = [True].

batch_iterators = [
    ds.iter_batches(batch_size=batch_size, batch_format="pandas")
    for ds, batch_size, flag in zip(datasets, batch_sizes, flags)
    if flag
]

for i, batch_tuple in enumerate(itertools.zip_longest(*batch_iterators)):
    ...

Logs

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/xyz/api.py", line 97, in run
    *outputs, second, third = foo(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/xyz/p1/p2/p3/python.py", line 186, in foo
    bar(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/xyz/utils/my_utils.py", line 391, in bar
    for i, batch_tuple in enumerate(itertools.zip_longest(*batch_iterators)):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/iterator.py", line 183, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 176, in iter_batches
    next_batch = next(async_batch_iter)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 926, in make_async_gen
    raise next_item
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 903, in execute_computation
    for item in fn(thread_safe_generator):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 167, in _async_iter_batches
    yield from extract_data_from_batch(batch_iter)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/util.py", line 210, in extract_data_from_batch
    for batch in batch_iter:
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 306, in restore_original_order
    for batch in batch_iter:
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 926, in make_async_gen
    raise next_item
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 903, in execute_computation
    for item in fn(thread_safe_generator):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 218, in threadpool_computations_format_collate
    yield from formatted_batch_iter
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/util.py", line 158, in format_batches
    for batch in block_iter:
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 883, in __next__
    return next(self.it)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/util.py", line 121, in blocks_to_batches
    batch = batcher.next_batch()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/batcher.py", line 149, in next_batch
    output.add_block(accessor.slice(0, needed, copy=False))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/pandas_block.py", line 184, in slice
    view = self._table[start:end]
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3777, in __getitem__
    indexer = convert_to_index_sliceable(self, key)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 2494, in convert_to_index_sliceable
    return idx._convert_slice_indexer(key, kind="getitem", is_frame=True)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/numeric.py", line 234, in _convert_slice_indexer
    return super()._convert_slice_indexer(key, kind=kind, is_frame=is_frame)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4294, in _convert_slice_indexer
    self._validate_indexer("slice", key.stop, "getitem")
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 6634, in _validate_indexer
    raise self._invalid_indexer(form, key)
TypeError: cannot do slice indexing on RangeIndex with these indexers [288320.0] of type float

Can you try on the latest ray image - 2.11 at time of writing? We merged a bunch of fixes around pandas so this might already be resolved.