About the Ray Data category

rliaw · August 17, 2021, 6:14pm

For questions on large-scale data loading, preprocessing, and batch transformations in distributed pipelines. Ray Datasets are the standard way to load and exchange data in Ray applications and provide transformations such as map, filter, and repartition.

Ray Data is a scalable data processing library for ML and AI workloads built on Ray. Ray Data provides flexible and performant APIs for expressing AI workloads such as batch inference, data preprocessing, and ingest for ML training. Unlike other distributed data systems, Ray Data features a streaming execution to efficiently process large datasets and maintain high utilization across both CPU and GPU workloads.

Documentation:

azizi · April 14, 2025, 11:25am

Ray Datasets is an essential tool for handling large-scale data loading, preprocessing, and batch transformations in distributed AI and ML pipelines. It leverages Ray’s distributed execution framework to process data efficiently, whether for training, inference, or preprocessing tasks. Ray Data provides a range of powerful transformations like map, filter, and repartition, allowing for flexible data manipulation at scale. What sets Ray apart is its streaming execution model, which enables continuous and efficient data processing across both CPU and GPU workloads. This allows users to maintain high throughput and utilization, even when dealing with massive datasets, making it an ideal choice for modern ML pipelines where both speed and scalability are critical. Whether you’re working with batch data for training or real-time streams for inference, Ray Data’s design ensures that you can scale seamlessly without sacrificing performance.

Topic		Replies	Views
Using ray for data processing	2	685	January 6, 2023
Run Ray Dataset in a big dataset Ray Data	2	1019	June 7, 2022
Benchmarks for Ray Data? Ray Data	13	1020	October 5, 2023
Dataset Pipelines - Window deprecated? Ray Data	2	196	August 29, 2024
Please help us: Take this survey to improve Ray Data for ML training	1	239	November 13, 2023

About the Ray Data category

Related topics