Ray dataset map_batches/map_groups params as part of ray tune hyperparams?

eifuentes · January 19, 2023, 8:07pm

Hi all,

I would appreciate some advise on how to use ray tune with ray dataset map commands together.

It is very common that dataset last-mile preprocessing is a hyperparam tightly coupled with model hyperparams.

For example any sequence based model looking at window_size of the sequence when constructing the training dataset e.g. X_train = train_dataset.groupby("USER").map_groups(transform_user_group, batch_format="pandas"). As of now I have a map_groups function transform_user_group that takes window size in and does the preprocessing , outputs another ray dataset that then gets passed into the training loop per worker step via the Trainer datasets param datasets={"train": X_train}, and later consumed inside the worker step using via the air session object session.get_dataset_shard("train").

However since that happened prior to the Trainer train_loop_config , it is unclear to me how to parameterize this. If the map_groups is done inside the train_loop_per_worker I think I loose out on some of the distributed benefits?

gjoliver · January 19, 2023, 10:21pm

what if you create 2 datasets? Something like (just for idea, I didn’t really try the example):

train_dataset = train_dataset.groupby("USER")

X_train_1 = train_dataset.map_groups(func1)
X_train_2 = train_dataset.map_groups(func2)

trainer = TorchTrainer(
    params={
        "some_other_param": ...,
    },
)

tuner = Tuner(
    trainer,
    param_space={"datasets": {"train": tune.choice([X_train_1, X_train_2])}},
    tune_config=tune.TuneConfig(...),
)

tuner.fit()

eifuentes · January 19, 2023, 10:32pm

Oh interesting! Nice idea. So within the training worker config select out the dataset. Is that compatible with session.get_dataset_shard or is there an equivalent to prepare_data_loader for existing ray datasets?

gjoliver · January 20, 2023, 3:54am

yeah, it should be compatible with session.get_dataset_shard().
the same session.get_dataset_shard("train") call should give you different datasets.

it’s also possible to tune the preprocessors param actually.

Topic		Replies	Views
Ray.tune - Best practices for reading datasets Ray Tune	1	569	February 18, 2022
Avoid moving datasets around the network when using tune.with_parameters Ray Tune	2	43	July 29, 2024
Apply function to (groupkey, groupvalue) of grouped by dataset Ray Data	1	540	December 23, 2022
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1248	February 14, 2023
Run ray dataset.map_batch in ray task Ray Client	0	37	November 27, 2024

Ray dataset map_batches/map_groups params as part of ray tune hyperparams?

Related topics