Padded batching

jharaldson · March 22, 2022, 9:14am

What is the recommended way to implement padded batching using Ray Datasets? We are using padded_batch() in Tensorflow, but would like to move to Ray Datasets since we can then get a common data pipeline for Tensorflow and Pytorch. The to_tf() function aims to do regular batching, but have not found any options for padded batching. Since Ray Datasets does not aim to implement all data operations, a mix of Ray operations and Tensorflow/Pytorch data pipeline operations would be the way to go, but for this particular case of padded batching, is it recommended to do the padded batching in Tensorflow/Pytorch?

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Clark_Zinzow · April 22, 2022, 5:41pm

Hi @jharaldson, thank you for posting this! Apologies for not getting back to you sooner, it looks like this post was missed.

You are correct that padded batching is currently not supported, but this is a great feature request! I’m surprised we haven’t heard it from other users. Would you mind opening a feature request on our GitHub repo?

One current option that might work is to use Datasets for preprocessing but delegate to TensorFlow Datasets for padded batching. This can be done by:

Providing a None batch size to to_tf(), namely ds.to_tf(batch_size=None), which will create a TensorFlow Dataset consisting of entire-block batches (no Datasets-level slicing).
Use unbatch() on the TF Dataset to get a TF Dataset consisting of a stream of rows.
Use padded_batch() on that TF Dataset.

This may or may not work with the existing ds.to_tf(). If it doesn’t, you can directly use ds.iter_batches(batch_size=None) and construct your own TF Dataset using the from_generator() API.

jharaldson · October 26, 2022, 12:36pm

@Clark_Zinzow, now that I noted you added support for ragged tensors I think the padded batching makes even more sense. Added a feature request.

Topic		Replies	Views
How to Keep Tensor Shape w/Ray Datasets? Ray Data	2	469	June 16, 2022
TFRecordDataset -> ray.data.Dataset for TensorflowTrainer Ray Data	7	1237	August 12, 2022
Issue in Ray dataset sharding	12	1107	October 15, 2022
Ray dataset from IterableDataset. No lazy implementation?	0	57	November 15, 2024
Migrating from TFRecords to ray.Data Ray Data	2	533	February 14, 2023

Padded batching

Related topics