Hi all,
I would appreciate some advise on how to use ray tune with ray dataset map commands together.
It is very common that dataset last-mile preprocessing is a hyperparam tightly coupled with model hyperparams.
For example any sequence based model looking at window_size
of the sequence when constructing the training dataset e.g. X_train = train_dataset.groupby("USER").map_groups(transform_user_group, batch_format="pandas")
. As of now I have a map_groups
function transform_user_group
that takes window size
in and does the preprocessing , outputs another ray dataset that then gets passed into the training loop per worker step via the Trainer datasets param datasets={"train": X_train},
and later consumed inside the worker step using via the air session object session.get_dataset_shard("train")
.
However since that happened prior to the Trainer train_loop_config
, it is unclear to me how to parameterize this. If the map_groups
is done inside the train_loop_per_worker
I think I loose out on some of the distributed benefits?