Bucketing in Ray Dataset?

We would like to implement bucketing in Ray Dataset.

We sort images into buckets based on their aspect ratio (how tall or wide), and then only emit batches such that all images in the batch belong to the same aspect ratio bucket. We decided that it’s impossible to implement with Ray Data’s transformations api, like map(), map_batches(), filter(), etc.
I would like you to help me confirm, that it’s impossible indeed.

So, the one way we see to solve this issue is to implement Custom Datasource which is basically analogous to ray.data.from_torch().
Is there any better approach? Did I make any incorrect assumptions previously?

Unfortunately this isn’t something that is natively supported very well. There is also a groupby method (docs here), which you can then use map_groups (docs here) to transform the groups. There are some limitations to this approach though. From the docs:

While map_groups() is very flexible, note that it comes with downsides:

  • It may be slower than using more specific methods such as min(), max().
  • It requires that each group fits in memory on a single node.

The second point sounds like it might be an issue for you (depending on how big your buckets will be). If that is the case implementing a Custom Datasource might be the best path forward for now. Either way, please feel free to submit an issue to the Ray Github so we can track this feature request!