Groupby key with None value

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello, I have been trying to do something pretty straight forward operations when it comes to dataset manipulations but I am failing at doing something efficient with Ray API and would love to have some feedback / help on this task.
I have reduce my code to this piece so that it can be easily reproducible

from ray.data.aggregate import AggregateFn
from ray.data.preprocessors.imputer import SimpleImputer
aggregate = AggregateFn(
init=lambda column: ,
# accumulate_row=check_accumulate,
accumulate_row=lambda a, r: a + [r[‘title’]],
# merge=check_merge,
merge=lambda a1, a2: a1 + a2,
name=‘title_grp’
)
count_err = 0
data = {
“category”: [‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘B’],
“sub_cat”: [‘a1’, ‘a1’, ‘a2’, ‘b1’, None, ‘b3’],
“title”: [f’tit{i}’ for i in range(6)],
#“value”: [1, 2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)
ray.data.from_pandas(df).groupby([‘category’, ‘sub_cat’]).aggregate(aggregate).take()

This crashed since we have a None value in column title, is this expected ? What’s the workaround you would advise in this example ?

I tried adding preprocessor like so

preprocessor = SimpleImputer(
columns=[“sub_cat”],
strategy=“constant”,
fill_value=“”)

But performances were super super slow on huge volume of data (10M+).
What would you suggest in this use case ?

Thanks for the great work