Where to apply data augmentations when using trainer?

Vedant_Roy · October 20, 2022, 7:32pm

I have some very CPU-intensive data-augmentations. When using torchdata, I would just add these as a map step.

However, I’m not sure where to add these in the pre-processing pipeline.
From reading, the source code, I see comments like:

                    # If the window size is infinity, the preprocessor is cached and
                    # we don't need to re-apply it each time.

which seems to imply that for (non-streaming modes), the preprocessor will only be applied once.
This would not be useful, if I wanted to re-augment the data every time.

Is the solution just to use streaming mode?

bveeramani · October 27, 2022, 4:53am

Hey Vedant, thanks for reaching out!

Looks like your question was already answered on Slack. I’m posting the answer in case anyone else discovers this thread.

If you set use_streaming_api to True and specify a finite stream_window_size, then preprocessing operations are applied every epoch.

Topic		Replies	Views
Parallelize TorchTrainer + Preprocessor + Training?	1	212	October 27, 2023
How to set pipeline windows for Torch Trainer?	4	263	August 10, 2023
Correctly sizing preprocessing Actor in Ray data Ray Data	3	65	June 26, 2024
Is it correct for this sample code? Ray Train	1	328	September 25, 2023
Using fractional GPU with TorchTrainer and Tuner API	3	910	August 22, 2023

Where to apply data augmentations when using trainer?

Related topics