How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I wonder whether there is a solution to use Ray Data with my current PyTorch DataLoader.
Here is my current code.
for inputs in next(dataloader): outputs = model(inputs) # inference postprocess_on_cpu(outputs)
Now I hope to decouple inference and post process to fully utilize CPU/GPU resources. I guess my code can be rewritten using Ray Data as follows.
ds = ray.data.read_parquet(data_uri) ds = ds.map(preprocess_on_cpu) ds = ds.map_batches(model, compute=ray.data.ActorPoolStrategy(size=8), num_gpus=1, batch_size=1024) ds = ds.map(postprocess_on_cpu) # then pull data from ds
The problem is to migrate data loading logic from PyTorch DataLoader to Ray Data. Ray Data Doc says:
Any logic for reading data from cloud storage and disk can be replaced by one of the Ray Data
read_*APIs, and any transformation logic can be applied as a
mapcall on the Dataset.
However, it may take some effort when data loading logic is complicated. Besides, I don’t want to risk data loading performance degradation.
It seems that
ray.data.Dataset is always constructed from some data source(for example,
ray.data.read_parquet). Is there any way to use Ray Data with my current PyTorch DataLoader, or I may need to use more primitive Ray APIs(for example, manipulate tasks and actors directly) to decouple inference and post process.