_collate_fn argument removed from ray.data.DataIterator.iter_batches

Hi

I noticed that _collate_fn argument was removed from ray.data.DataIterator.iter_batches in 2.47.0 release.

My understanding was that ray.train.get_dataset_shard should be used to iterate through batches of data when number of trainers are more than one.

I also noticed that the documentation still refers to use of collage_fn in iter_batches

I’m sure there must be a reason, but now it breaks my workflow.

I am not sure if there is any other way to use collate_fn without changing my dataset inferface.

Can someone help please?

2.48.0 API reference

2.47.0 API reference

I’m using 2.48.0 and testing it locally.

FYI I implemented my custom dataset and a training script that uses following pattern.


train_dataset = ray.data.read_parquet(dataset_info_file)
train_dataset = train_dataset.map(ReadDataset, concurrency=4, num_cpus=4)

###### training function ######

def train_loop_per_worker(config):
    ...
    ...
    def collate_fn(batch):
        ...
        ...
        return train_data, train_target, train_metadata

    train_data_shard = ray.train.get_dataset_shard("train")
    train_dataloader = train_data_shard.iter_batches(batch_size=batch_size, _collate_fn=collate_fn, prefetch_batches=True, drop_last=True)

    for epoch in range(start_epoch, num_epochs):
        ...
        ...
        for data, targets, metadata in train_dataloader:
            ...
            ...

Thank you.

Hey @mitul93 , are you able to use iter_torch_batches?

collate_fn was originally intended only for iter_torch_batches, so it’s been removed from the iter_batches API.

I can update the documentation that mentions it for `iter_batches`.

@matthewdeng Thank you for a quick response.

I can’t use iter_torch_batches without changing the way my dataset returns data, target and metadata.

Also I got following warning message while I was trying to change the training loop.

/usr/local/lib/python3.10/dist-packages/ray/data/iterator.py:445: RayDeprecationWarning: Passing a function to `iter_torch_batches(collate_fn)` is deprecated in Ray 2.47. Please switch to using a callable class that inherits from `ArrowBatchCollateFn`, `NumpyBatchCollateFn`, or `PandasBatchCollateFn`.

I find it a bit confusing to use collage_fn by following Document and examples. I think I’ll change the way my dataloader returns data and try to avoid collate_fn

I’ll ask for help in case I’m stuck somewhere :slight_smile:

Thanks.