Groupby key with None value

PieReissyet · August 1, 2024, 3:07pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello, I have been trying to do something pretty straight forward operations when it comes to dataset manipulations but I am failing at doing something efficient with Ray API and would love to have some feedback / help on this task.
I have reduce my code to this piece so that it can be easily reproducible

from ray.data.aggregate import AggregateFn
from ray.data.preprocessors.imputer import SimpleImputer
aggregate = AggregateFn(
init=lambda column: ,
# accumulate_row=check_accumulate,
accumulate_row=lambda a, r: a + [r[‘title’]],
# merge=check_merge,
merge=lambda a1, a2: a1 + a2,
name=‘title_grp’
)
count_err = 0
data = {
“category”: [‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘B’],
“sub_cat”: [‘a1’, ‘a1’, ‘a2’, ‘b1’, None, ‘b3’],
“title”: [f’tit{i}’ for i in range(6)],
#“value”: [1, 2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)
ray.data.from_pandas(df).groupby([‘category’, ‘sub_cat’]).aggregate(aggregate).take()

This crashed since we have a None value in column title, is this expected ? What’s the workaround you would advise in this example ?

I tried adding preprocessor like so

preprocessor = SimpleImputer(
columns=[“sub_cat”],
strategy=“constant”,
fill_value=“”)

But performances were super super slow on huge volume of data (10M+).
What would you suggest in this use case ?

Thanks for the great work

Topic		Replies	Views
How to do a groupby of a Ray dataset using two keys?	2	456	November 7, 2022
Dataset in Pandas Returns Arrow Argument When Materializing Ray Data	0	277	May 22, 2024
Partition by a key Ray Data	1	553	August 1, 2022
Apply function to (groupkey, groupvalue) of grouped by dataset Ray Data	1	539	December 23, 2022
[Ray Data] Need a custom ray.data.aggregate.AggregateFn to sum over numpy arrays	2	294	May 1, 2024

Groupby key with None value

Related topics