Dataset Pipelines - Window deprecated?

calonsca · August 28, 2024, 6:06pm

Hi, a few months ago I bought the book Learning Ray Flexible Distributed Python for Machine Learning by Max Pumperla, Edward Oakes & Richard Liaw, in this book the version of Ray they use is 2.2.0.

In this book there is a section where they talk about the Dataset and how there is a functionality so that the operations are not blocking and as soon as the data are available they continue with the subsequent instructions, it is called DatasetPipeline and uses a function called window(blocks_per_windows=5) for example

While trying to do the examples in the book with the latest version of Ray, I am realizing that this function (window) does not exist now.

For example, the following code says that until all the data has been processed with the function called cpu_intensive_preprocessing, it does not proceed with the function gpu_intensive_inference, so there is a time where the GPU is idle.

ds = (ray.data.read_parquet("s3://my_bucket/input_data")
      .map(cpu_intensive_preprocessing)
      .map_batches(gpu_intensive_inference, compute="actors", num_gpus=1)
      .repartition(10))

So as described in the book, to avoid that some resources are idle, you use the window parameter (window(blocks_per_window=5) that basically processes in blocks and so when the first blocks are ready, you can start to perform the tasks in the gpu_intensive_inference function like this:

ds = (ray.data.read_parquet("s3://my_bucket/input_data")
      .window(blocks_per_window=5)
      .map(cpu_intensive_preprocessing)
      .map_batches(gpu_intensive_inference, compute="actors", num_gpus=1)
      .repartition(10))

In the new versions of Ray, this is no longer necessary because it is done automatically?

Thanks

sjl · August 28, 2024, 11:53pm

Yes, that’s correct. DatasetPipeline is deprecated because Ray Data uses streaming execution by default (basically always using DatasetPipeline). window() is also deprecated as a part of this. You can read more details here: Ray Data Internals — Ray 2.34.0

calonsca · August 29, 2024, 7:11am

Thank you very much for the quick reply!
I guess some things in the book will be a bit outdated by now, so I think I’d better read the official documentation on the web.

Topic		Replies	Views
Ray datasets streaming block split? Ray Data	1	624	June 27, 2023
Ray dataset pipeline scheduling missing opportunities	3	302	August 17, 2023
Prevent restart of actors in DatasetPipeline	0	209	July 24, 2023
About the Ray Data category Ray Data	1	735	April 14, 2025
Issues with gpu usage when Ray Data is used in docker	1	249	June 14, 2023

Dataset Pipelines - Window deprecated?

Related topics