How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I wonder whether there is a solution to use Ray Data with my current PyTorch DataLoader.
Here is my current code.
for inputs in next(dataloader):
outputs = model(inputs) # inference
postprocess_on_cpu(outputs)
Now I hope to decouple inference and post process to fully utilize CPU/GPU resources. I guess my code can be rewritten using Ray Data as follows.
ds = ray.data.read_parquet(data_uri)
ds = ds.map(preprocess_on_cpu)
ds = ds.map_batches(model, compute=ray.data.ActorPoolStrategy(size=8), num_gpus=1, batch_size=1024)
ds = ds.map(postprocess_on_cpu)
# then pull data from ds
The problem is to migrate data loading logic from PyTorch DataLoader to Ray Data. Ray Data Doc says:
Any logic for reading data from cloud storage and disk can be replaced by one of the Ray Data
read_*
APIs, and any transformation logic can be applied as amap
call on the Dataset.
However, it may take some effort when data loading logic is complicated. Besides, I don’t want to risk data loading performance degradation.
It seems that ray.data.Dataset
is always constructed from some data source(for example, ray.data.read_parquet
). Is there any way to use Ray Data with my current PyTorch DataLoader, or I may need to use more primitive Ray APIs(for example, manipulate tasks and actors directly) to decouple inference and post process.