Keep PyTorch DataLoader when using Ray Data

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I wonder whether there is a solution to use Ray Data with my current PyTorch DataLoader.

Here is my current code.

for inputs in next(dataloader):
    outputs = model(inputs) # inference
    postprocess_on_cpu(outputs)

Now I hope to decouple inference and post process to fully utilize CPU/GPU resources. I guess my code can be rewritten using Ray Data as follows.

ds = ray.data.read_parquet(data_uri)
ds = ds.map(preprocess_on_cpu)
ds = ds.map_batches(model, compute=ray.data.ActorPoolStrategy(size=8), num_gpus=1, batch_size=1024)
ds = ds.map(postprocess_on_cpu)
# then pull data from ds

The problem is to migrate data loading logic from PyTorch DataLoader to Ray Data. Ray Data Doc says:

Any logic for reading data from cloud storage and disk can be replaced by one of the Ray Data read_* APIs, and any transformation logic can be applied as a map call on the Dataset.

However, it may take some effort when data loading logic is complicated. Besides, I don’t want to risk data loading performance degradation.

It seems that ray.data.Dataset is always constructed from some data source(for example, ray.data.read_parquet). Is there any way to use Ray Data with my current PyTorch DataLoader, or I may need to use more primitive Ray APIs(for example, manipulate tasks and actors directly) to decouple inference and post process.