Chaining multiple AggregateFns and applying them to the same column

Keshi_Dai1 · March 20, 2023, 5:25pm

Is it possible to chain multiple AggregateFns and apply them to the same column or a set of columns? E.g. I would like to compute stats for columns “A” and “B” in a dataset. As it is now, I have to do the following for each aggregator and column:

data.aggregate(Max("A"), Min("A"), Mean("A"), Max("B"), Min("B"), Mean("B"))

Is there any easier way to define this? Under the hood, does Ray actually apply the optimization and only need one pass of the entire dataset?

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

bveeramani · March 20, 2023, 9:10pm

Hey @Keshi_Dai1, thanks for posting your questions.

Is there any easier way to define this?

Don’t think there’s an easier way to define this. Did you have a potential API in mind?

Under the hood, does Ray actually apply the optimization and only need one pass of the entire dataset?

Yeah, this is optimized. Ray doesn’t perform a pass for each aggregation function.

Keshi_Dai1 · March 20, 2023, 10:04pm

Thanks @bveeramani!

I’m working on an API to compute stats for columns in a dataset. It’s not a big deal as long as it’s optimized under the hood. I’m able to get the columns via schema.

def compute_feature_stats(features, columns = None):
    if not columns:
        columns = features.schema().names

    aggregators = list()
    for column in columns:
        aggregators.append(Mean(column))
        aggregators.append(Std(column))
        aggregators.append(Max(column))
        aggregators.append(Min(column))
        ...
    stats = features.aggregate(*aggregators)
    ...

Topic		Replies	Views
[Ray Data] Need a custom ray.data.aggregate.AggregateFn to sum over numpy arrays	2	290	May 1, 2024
Groupby key with None value Ray Data	0	12	August 1, 2024
Groupby with bigdata	1	113	July 31, 2024
Dataset in Pandas Returns Arrow Argument When Materializing Ray Data	0	255	May 22, 2024
What's the migration path for ray.data.aggregate's Max, Mean, Min, and Std functions? Ray Data	2	31	March 6, 2025

Chaining multiple AggregateFns and applying them to the same column

Related topics