Groupby performance issues with many small groups

mattjordan · October 23, 2023, 6:18pm

Hi,

I’m playing around with ray datasets and I’m noticing some seemingly weird behavior – would love to hear from an expert to see if this is expected.

My setup is that I have a dataset of 450k rows and I want to do some operations that involve EITHER a groupby->map_groups OR a sort->map_batches. In my dataset, the majority of the groups are of size 1 (>400k rows of size 1, with a maximum group size of 11).

When I try the two following operations, I see a massive disparity in runtimes:
ds.groupby('bucketdoc_hash').map_groups(lambda g: g).materialize()
vs
ds.sort('bucketdoc_hash').map_batches(lambda b: b, batch_size=1000).materialize()

Where the former runs in ~90s on my macbook, and the latter runs in <1s. When I set the batch_size to 1, the latter still runs in ~30s.

Is this expected performance and something that is known? Or rather, am I doing anything not ray-ful with the groupby side of things?

(for reproducibility purposes, here’s a google drive link to the parquet file of the dataset I’m playing with: fd233f9209ad423b9a8bf76612ddc5f3_000000.parquet - Google Drive )

raulchen · October 25, 2023, 10:11pm

Your usage is correct. But the groupby API hasn’t been optimized for large-scale use cases. So it may perform worse.

Topic		Replies	Views
Ray data experience OOM issue during write_csv or write_parquet Ray Data	2	496	August 2, 2023
Dataset statistics best practice Ray Data	2	490	January 14, 2023
Groupby with bigdata	1	121	July 31, 2024
Possible reasons for ray data stucks at write_csv (or write_parquet)?	3	370	July 25, 2023
Bucketing in Ray Dataset?	1	41	November 18, 2024

Groupby performance issues with many small groups

Related topics