Dataset Pipelines - Window deprecated?

Hi, a few months ago I bought the book Learning Ray Flexible Distributed Python for Machine Learning by Max Pumperla, Edward Oakes & Richard Liaw, in this book the version of Ray they use is 2.2.0.

In this book there is a section where they talk about the Dataset and how there is a functionality so that the operations are not blocking and as soon as the data are available they continue with the subsequent instructions, it is called DatasetPipeline and uses a function called window(blocks_per_windows=5) for example

While trying to do the examples in the book with the latest version of Ray, I am realizing that this function (window) does not exist now.

For example, the following code says that until all the data has been processed with the function called cpu_intensive_preprocessing, it does not proceed with the function gpu_intensive_inference, so there is a time where the GPU is idle.

ds = (ray.data.read_parquet("s3://my_bucket/input_data")
      .map(cpu_intensive_preprocessing)
      .map_batches(gpu_intensive_inference, compute="actors", num_gpus=1)
      .repartition(10))

So as described in the book, to avoid that some resources are idle, you use the window parameter (window(blocks_per_window=5) that basically processes in blocks and so when the first blocks are ready, you can start to perform the tasks in the gpu_intensive_inference function like this:

ds = (ray.data.read_parquet("s3://my_bucket/input_data")
      .window(blocks_per_window=5)
      .map(cpu_intensive_preprocessing)
      .map_batches(gpu_intensive_inference, compute="actors", num_gpus=1)
      .repartition(10))

In the new versions of Ray, this is no longer necessary because it is done automatically?

Thanks

Yes, that’s correct. DatasetPipeline is deprecated because Ray Data uses streaming execution by default (basically always using DatasetPipeline). window() is also deprecated as a part of this. You can read more details here: Ray Data Internals — Ray 2.34.0

1 Like

Thank you very much for the quick reply!
I guess some things in the book will be a bit outdated by now, so I think I’d better read the official documentation on the web.