Is it correct for this sample code?

In the doc Computer Vision — Ray 2.6.3 where a CV sample is given to show how the AIR works for CV . Here is a code snippet,


dataset = per_epoch_preprocessor.transform(dataset)
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={“batch_size”: 32, “lr”: 0.02, “epochs”: 1},
datasets={“train”: dataset},
scaling_config=ScalingConfig(num_workers=2),
preprocessor=preprocessor,
)
results = trainer.fit()


The two preprocessors seem put wrong places and the right should look like this:


dataset = preprocessor.transform(dataset)
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={“batch_size”: 32, “lr”: 0.02, “epochs”: 1},
datasets={“train”: dataset},
scaling_config=ScalingConfig(num_workers=2),
preprocessor=per_epoch_preprocessor,
)
results = trainer.fit()

Let me know if I think it right and pls be free to correct me if I am wrong.

Hi @Li_Bin,

The integration of dataset preprocessing with Ray Data as the data ingest for Ray Train has been re-worked in Ray 2.7. This user guide is now using outdated APIs, as you’re no longer able to pass in a predictor into a Trainer. Here’s a quick rundown of how you should do global + per epoch preprocessing:

preprocessor in this example is meant to be the “global” preprocessor that gets applied before training.

per_epoch_preprocessor is meant to be applied on the fly, as we iterate through the dataset, and may contain some random operations.

Because Ray Data now executes the stages in a streaming fashion, both these preprocessors get applied on the fly, as data gets read by your Ray Train script. So there is not really a distinction between the “global” and per-epoch preprocessor anymore, unless you explicitly materialize the dataset at some stage.

dataset = ...

global_preprocessor = TorchVisionPreprocessor(...)
dataset = global_preprocessor.transform(dataset)

# Optional: if you want to cache the outputs after global preprocessing in object store memory
dataset = dataset.materialize()

# If we materialized at the previous step, only the per_epoch_preprocessor
# logic will get run on the fly during training.
dataset = per_epoch_preprocessor.transform(dataset)

trainer = TorchTrainer(
    ...,
    datasets={"train": dataset},
)

This user guide goes into more detail: Data Loading and Preprocessing — Ray 2.7.0

1 Like