Is it possible to chain multiple AggregateFns and apply them to the same column or a set of columns? E.g. I would like to compute stats for columns “A” and “B” in a dataset. As it is now, I have to do the following for each aggregator and column:
data.aggregate(Max("A"), Min("A"), Mean("A"), Max("B"), Min("B"), Mean("B"))
Is there any easier way to define this? Under the hood, does Ray actually apply the optimization and only need one pass of the entire dataset?
How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
1 Like
Hey @Keshi_Dai1, thanks for posting your questions.
Is there any easier way to define this?
Don’t think there’s an easier way to define this. Did you have a potential API in mind?
Under the hood, does Ray actually apply the optimization and only need one pass of the entire dataset?
Yeah, this is optimized. Ray doesn’t perform a pass for each aggregation function.
Thanks @bveeramani!
I’m working on an API to compute stats for columns in a dataset. It’s not a big deal as long as it’s optimized under the hood. I’m able to get the columns via schema.
def compute_feature_stats(features, columns = None):
if not columns:
columns = features.schema().names
aggregators = list()
for column in columns:
aggregators.append(Mean(column))
aggregators.append(Std(column))
aggregators.append(Max(column))
aggregators.append(Min(column))
...
stats = features.aggregate(*aggregators)
...
1 Like