Ray dataset map_batches/map_groups params as part of ray tune hyperparams?

Hi all,

I would appreciate some advise on how to use ray tune with ray dataset map commands together.

It is very common that dataset last-mile preprocessing is a hyperparam tightly coupled with model hyperparams.

For example any sequence based model looking at window_size of the sequence when constructing the training dataset e.g. X_train = train_dataset.groupby("USER").map_groups(transform_user_group, batch_format="pandas"). As of now I have a map_groups function transform_user_group that takes window size in and does the preprocessing , outputs another ray dataset that then gets passed into the training loop per worker step via the Trainer datasets param datasets={"train": X_train}, and later consumed inside the worker step using via the air session object session.get_dataset_shard("train").

However since that happened prior to the Trainer train_loop_config , it is unclear to me how to parameterize this. If the map_groups is done inside the train_loop_per_worker I think I loose out on some of the distributed benefits?

what if you create 2 datasets? Something like (just for idea, I didn’t really try the example):

train_dataset = train_dataset.groupby("USER")

X_train_1 = train_dataset.map_groups(func1)
X_train_2 = train_dataset.map_groups(func2)

trainer = TorchTrainer(
    params={
        "some_other_param": ...,
    },
)

tuner = Tuner(
    trainer,
    param_space={"datasets": {"train": tune.choice([X_train_1, X_train_2])}},
    tune_config=tune.TuneConfig(...),
)

tuner.fit()

Oh interesting! Nice idea. So within the training worker config select out the dataset. Is that compatible with session.get_dataset_shard or is there an equivalent to prepare_data_loader for existing ray datasets?

yeah, it should be compatible with session.get_dataset_shard().
the same session.get_dataset_shard("train") call should give you different datasets.

it’s also possible to tune the preprocessors param actually.