Padded batching

What is the recommended way to implement padded batching using Ray Datasets? We are using padded_batch() in Tensorflow, but would like to move to Ray Datasets since we can then get a common data pipeline for Tensorflow and Pytorch. The to_tf() function aims to do regular batching, but have not found any options for padded batching. Since Ray Datasets does not aim to implement all data operations, a mix of Ray operations and Tensorflow/Pytorch data pipeline operations would be the way to go, but for this particular case of padded batching, is it recommended to do the padded batching in Tensorflow/Pytorch?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi @jharaldson, thank you for posting this! Apologies for not getting back to you sooner, it looks like this post was missed.

You are correct that padded batching is currently not supported, but this is a great feature request! I’m surprised we haven’t heard it from other users. Would you mind opening a feature request on our GitHub repo?

One current option that might work is to use Datasets for preprocessing but delegate to TensorFlow Datasets for padded batching. This can be done by:

  1. Providing a None batch size to to_tf(), namely ds.to_tf(batch_size=None), which will create a TensorFlow Dataset consisting of entire-block batches (no Datasets-level slicing).
  2. Use unbatch() on the TF Dataset to get a TF Dataset consisting of a stream of rows.
  3. Use padded_batch() on that TF Dataset.

This may or may not work with the existing ds.to_tf(). If it doesn’t, you can directly use ds.iter_batches(batch_size=None) and construct your own TF Dataset using the from_generator() API.