Hi all,
I would appreciate some advise on how to use ray tune with ray dataset map commands together.
It is very common that dataset last-mile preprocessing is a hyperparam tightly coupled with model hyperparams.
For example any sequence based model looking at window_size of the sequence when constructing the training dataset e.g. X_train = train_dataset.groupby("USER").map_groups(transform_user_group, batch_format="pandas"). As of now I have a map_groups function transform_user_group that takes window size in and does the preprocessing , outputs another ray dataset that then gets passed into the training loop per worker step via the Trainer datasets param datasets={"train": X_train}, and later consumed inside the worker step using via the air session object session.get_dataset_shard("train").
However since that happened prior to the Trainer train_loop_config , it is unclear to me how to parameterize this. If the map_groups is done inside the train_loop_per_worker I think I loose out on some of the distributed benefits?