Dataset statistics best practice

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

xposting from Slack community (here)

hey all, new to ray datasets. any advise on how to do holistic operations using map_batches() or map_groups() ?

Use cases that come to mind are counting for histograms and percentile calculations across the entire dataset.

Seems like ray datasets is setup well to do map-reduce operations, but curious on some best practices and approaches.

Thank you!


Hi Emmanuel, welcome to using Ray Datasets!
As a starter, we have the groupby API which supports custom aggregation.
Do you have more details/examples on what you want to achieve with histogram/percentile functions?


thanks Scott, I totally missed this in the documentation. Ty!