Keep PyTorch DataLoader when using Ray Data

xuyifangreeneyes · November 7, 2023, 2:00am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I wonder whether there is a solution to use Ray Data with my current PyTorch DataLoader.

Here is my current code.

for inputs in next(dataloader):
    outputs = model(inputs) # inference
    postprocess_on_cpu(outputs)

Now I hope to decouple inference and post process to fully utilize CPU/GPU resources. I guess my code can be rewritten using Ray Data as follows.

ds = ray.data.read_parquet(data_uri)
ds = ds.map(preprocess_on_cpu)
ds = ds.map_batches(model, compute=ray.data.ActorPoolStrategy(size=8), num_gpus=1, batch_size=1024)
ds = ds.map(postprocess_on_cpu)
# then pull data from ds

The problem is to migrate data loading logic from PyTorch DataLoader to Ray Data. Ray Data Doc says:

Any logic for reading data from cloud storage and disk can be replaced by one of the Ray Data read_* APIs, and any transformation logic can be applied as a map call on the Dataset.

However, it may take some effort when data loading logic is complicated. Besides, I don’t want to risk data loading performance degradation.

It seems that ray.data.Dataset is always constructed from some data source(for example, ray.data.read_parquet). Is there any way to use Ray Data with my current PyTorch DataLoader, or I may need to use more primitive Ray APIs(for example, manipulate tasks and actors directly) to decouple inference and post process.

Topic		Replies	Views
Using ray datasets with pytorch lightning	0	319	November 22, 2023
Parallel inference using CPUs Ray Core	2	833	July 7, 2023
Can Ray Dataset facilitate training on heterogeneous clusters? Ray Data	6	1099	December 26, 2022
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1241	February 14, 2023
Streaming data for training/evaluation/inference Ray Data	3	1333	May 3, 2022

Keep PyTorch DataLoader when using Ray Data

Related topics