Hi, a few months ago I bought the book Learning Ray Flexible Distributed Python for Machine Learning by Max Pumperla, Edward Oakes & Richard Liaw, in this book the version of Ray they use is 2.2.0.
In this book there is a section where they talk about the Dataset and how there is a functionality so that the operations are not blocking and as soon as the data are available they continue with the subsequent instructions, it is called DatasetPipeline and uses a function called window(blocks_per_windows=5) for example
While trying to do the examples in the book with the latest version of Ray, I am realizing that this function (window) does not exist now.
For example, the following code says that until all the data has been processed with the function called cpu_intensive_preprocessing, it does not proceed with the function gpu_intensive_inference, so there is a time where the GPU is idle.
ds = (ray.data.read_parquet("s3://my_bucket/input_data")
.map(cpu_intensive_preprocessing)
.map_batches(gpu_intensive_inference, compute="actors", num_gpus=1)
.repartition(10))
So as described in the book, to avoid that some resources are idle, you use the window parameter (window(blocks_per_window=5) that basically processes in blocks and so when the first blocks are ready, you can start to perform the tasks in the gpu_intensive_inference function like this:
ds = (ray.data.read_parquet("s3://my_bucket/input_data")
.window(blocks_per_window=5)
.map(cpu_intensive_preprocessing)
.map_batches(gpu_intensive_inference, compute="actors", num_gpus=1)
.repartition(10))
In the new versions of Ray, this is no longer necessary because it is done automatically?
Thanks