Groupby performance issues with many small groups


I’m playing around with ray datasets and I’m noticing some seemingly weird behavior – would love to hear from an expert to see if this is expected.

My setup is that I have a dataset of 450k rows and I want to do some operations that involve EITHER a groupby->map_groups OR a sort->map_batches. In my dataset, the majority of the groups are of size 1 (>400k rows of size 1, with a maximum group size of 11).

When I try the two following operations, I see a massive disparity in runtimes:
ds.groupby('bucketdoc_hash').map_groups(lambda g: g).materialize()
ds.sort('bucketdoc_hash').map_batches(lambda b: b, batch_size=1000).materialize()

Where the former runs in ~90s on my macbook, and the latter runs in <1s. When I set the batch_size to 1, the latter still runs in ~30s.

Is this expected performance and something that is known? Or rather, am I doing anything not ray-ful with the groupby side of things?

(for reproducibility purposes, here’s a google drive link to the parquet file of the dataset I’m playing with: fd233f9209ad423b9a8bf76612ddc5f3_000000.parquet - Google Drive )

Your usage is correct. But the groupby API hasn’t been optimized for large-scale use cases. So it may perform worse.