Dataset statistics best practice

eifuentes · January 12, 2023, 4:32pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

xposting from Slack community (here)

hey all, new to ray datasets. any advise on how to do holistic operations using map_batches() or map_groups() ?

Use cases that come to mind are counting for histograms and percentile calculations across the entire dataset.

Seems like ray datasets is setup well to do map-reduce operations, but curious on some best practices and approaches.

Thank you!

Emmanuel

sjl · January 12, 2023, 9:41pm

Hi Emmanuel, welcome to using Ray Datasets!
As a starter, we have the groupby API which supports custom aggregation.
Do you have more details/examples on what you want to achieve with histogram/percentile functions?

Scott

eifuentes · January 14, 2023, 12:31am

thanks Scott, I totally missed this in the documentation. Ty!

Topic		Replies	Views
Bucketing in Ray Dataset?	1	48	November 18, 2024
Ray Data Map batches performance optimization Ray Data	2	245	August 1, 2024
Groupby performance issues with many small groups Ray Data	1	479	October 25, 2023
Apply function to (groupkey, groupvalue) of grouped by dataset Ray Data	1	540	December 23, 2022
Ray dataset map_batches/map_groups params as part of ray tune hyperparams?	3	414	January 20, 2023

Dataset statistics best practice

Related topics