Error after shuffeling ray dataset when splitting in train und test

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

The Problem:

I have a large ray dataset and want to split it like this:

train_ds, test_ds = ray_ds.train_test_split(test_size=0.05, shuffle=True)

for batch in train.iter_torch_batches(batch_size=1, device="cuda:0", ):
      # training

Unfortunately it only works with shuffle=False otherwise I get the following error:

Traceback (most recent call last):
  File "/home/USER/PycharmProjects/Projectname_Preprocessing/CanonicalDataset.py", line 395, in <module>
    for batch in train.iter_torch_batches(batch_size=1, device="cuda:0", ):
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/data/dataset.py", line 2523, in iter_torch_batches
    for batch in self.iter_batches(
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/data/dataset.py", line 2450, in iter_batches
    yield from batch_blocks(
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/data/_internal/block_batching.py", line 129, in batch_blocks
    yield from get_batches(block_window[0])
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/data/_internal/block_batching.py", line 99, in get_batches
    result = _format_batch(batch, batch_format)
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/data/_internal/block_batching.py", line 147, in _format_batch
    batch = BlockAccessor.for_block(batch).to_numpy()
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/data/_internal/arrow_block.py", line 217, in to_numpy
    arrays.append(array.to_numpy(zero_copy_only=False))
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/air/util/tensor_extensions/arrow.py", line 285, in to_numpy
    return self._to_numpy(zero_copy_only=zero_copy_only)
  File "/home/USER/anaconda3/envs/Projectname_Preprocessing/lib/python3.9/site-packages/ray/air/util/tensor_extensions/arrow.py", line 269, in _to_numpy
    return np.ndarray(shape, dtype=ext_dtype, buffer=data_buffer, offset=offset)
TypeError: buffer is too small for requested array

I’m grateful for any ideas how to fix it.

Hi @Alpe6825, what version of Ray are you using? I think we fixed a similar issue in a recent release (or in latest master).

Hi @Clark_Zinzow. Thank you for your reply. I use ray 2.1.0 installed via pip.