I’m getting the following error (from Pandas) when iterating over batches using Dataset.iter_batches()
:
TypeError: cannot do slice indexing on RangeIndex with these indexers [288320.0] of type float
Full traceback at the bottom.
Any help would be appreciated.
Details
This is running on an Anyscale cluster with base image anyscale/ray-ml:2.9.0-py39-cpu
Weirdly, I don’t get the error for smaller datasets (up to 10^8 rows, ~10GB total CSV file size) with the same schema. The error shows up on a datasets with 5 * 10^8 rows (~50GB in total CSV file size) and larger.
Code
In all relevant cases, datasets
contains just a single Ray Dataset loaded from either CSV or Parquet files on S3.
batch_sizes = [1_000_000]
and flags = [True]
.
batch_iterators = [
ds.iter_batches(batch_size=batch_size, batch_format="pandas")
for ds, batch_size, flag in zip(datasets, batch_sizes, flags)
if flag
]
for i, batch_tuple in enumerate(itertools.zip_longest(*batch_iterators)):
...
Logs
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.9/site-packages/xyz/api.py", line 97, in run
*outputs, second, third = foo(
File "/home/ray/anaconda3/lib/python3.9/site-packages/xyz/p1/p2/p3/python.py", line 186, in foo
bar(
File "/home/ray/anaconda3/lib/python3.9/site-packages/xyz/utils/my_utils.py", line 391, in bar
for i, batch_tuple in enumerate(itertools.zip_longest(*batch_iterators)):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/iterator.py", line 183, in _create_iterator
for batch in iterator:
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 176, in iter_batches
next_batch = next(async_batch_iter)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 926, in make_async_gen
raise next_item
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 903, in execute_computation
for item in fn(thread_safe_generator):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 167, in _async_iter_batches
yield from extract_data_from_batch(batch_iter)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/util.py", line 210, in extract_data_from_batch
for batch in batch_iter:
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 306, in restore_original_order
for batch in batch_iter:
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 926, in make_async_gen
raise next_item
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 903, in execute_computation
for item in fn(thread_safe_generator):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 218, in threadpool_computations_format_collate
yield from formatted_batch_iter
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/util.py", line 158, in format_batches
for batch in block_iter:
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 883, in __next__
return next(self.it)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/util.py", line 121, in blocks_to_batches
batch = batcher.next_batch()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/batcher.py", line 149, in next_batch
output.add_block(accessor.slice(0, needed, copy=False))
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/pandas_block.py", line 184, in slice
view = self._table[start:end]
File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 3777, in __getitem__
indexer = convert_to_index_sliceable(self, key)
File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 2494, in convert_to_index_sliceable
return idx._convert_slice_indexer(key, kind="getitem", is_frame=True)
File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/numeric.py", line 234, in _convert_slice_indexer
return super()._convert_slice_indexer(key, kind=kind, is_frame=is_frame)
File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4294, in _convert_slice_indexer
self._validate_indexer("slice", key.stop, "getitem")
File "/home/ray/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 6634, in _validate_indexer
raise self._invalid_indexer(form, key)
TypeError: cannot do slice indexing on RangeIndex with these indexers [288320.0] of type float