Is it correct for this sample code?

Li_Bin · September 17, 2023, 1:50pm

In the doc Computer Vision — Ray 2.6.3 where a CV sample is given to show how the AIR works for CV . Here is a code snippet,

dataset = per_epoch_preprocessor.transform(dataset)
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={“batch_size”: 32, “lr”: 0.02, “epochs”: 1},
datasets={“train”: dataset},
scaling_config=ScalingConfig(num_workers=2),
preprocessor=preprocessor,
)
results = trainer.fit()

The two preprocessors seem put wrong places and the right should look like this:

dataset = preprocessor.transform(dataset)
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={“batch_size”: 32, “lr”: 0.02, “epochs”: 1},
datasets={“train”: dataset},
scaling_config=ScalingConfig(num_workers=2),
preprocessor=per_epoch_preprocessor,
)
results = trainer.fit()

Let me know if I think it right and pls be free to correct me if I am wrong.

justinvyu · September 25, 2023, 9:09pm

Hi @Li_Bin,

The integration of dataset preprocessing with Ray Data as the data ingest for Ray Train has been re-worked in Ray 2.7. This user guide is now using outdated APIs, as you’re no longer able to pass in a predictor into a Trainer. Here’s a quick rundown of how you should do global + per epoch preprocessing:

preprocessor in this example is meant to be the “global” preprocessor that gets applied before training.

per_epoch_preprocessor is meant to be applied on the fly, as we iterate through the dataset, and may contain some random operations.

Because Ray Data now executes the stages in a streaming fashion, both these preprocessors get applied on the fly, as data gets read by your Ray Train script. So there is not really a distinction between the “global” and per-epoch preprocessor anymore, unless you explicitly materialize the dataset at some stage.

dataset = ...

global_preprocessor = TorchVisionPreprocessor(...)
dataset = global_preprocessor.transform(dataset)

# Optional: if you want to cache the outputs after global preprocessing in object store memory
dataset = dataset.materialize()

# If we materialized at the previous step, only the per_epoch_preprocessor
# logic will get run on the fly during training.
dataset = per_epoch_preprocessor.transform(dataset)

trainer = TorchTrainer(
    ...,
    datasets={"train": dataset},
)

This user guide goes into more detail: Data Loading and Preprocessing — Ray 2.7.0

Topic		Replies	Views
Parallelize TorchTrainer + Preprocessor + Training?	1	212	October 27, 2023
Correctly sizing preprocessing Actor in Ray data Ray Data	3	65	June 26, 2024
Ray dataset map_batches/map_groups params as part of ray tune hyperparams?	3	401	January 20, 2023
How to get predict class probabilities when using SklearnTrainer	3	414	January 26, 2023
Large dataset ray dataset OOM Ray Tune	2	437	July 3, 2023

Is it correct for this sample code?

Related topics