Background:
- I have a dataset containing eight weeks of data. It is O(10Gb) in size.
- I want to train and validate on sliding windows:
- Train on (W1, W2), Test on (W3, W4)
- Train on (W2, W3), Test on (W4, W5)
- etc etc
- For each split, I want to perform grid search over O(100) hyperparameters
- Optimal hyperparameters are chosen as those that give the best mean test score over all windows
I am assuming that ray.tune will be important here, so that I’ll do something like:
analysis = tune.run(
train_one_model,
verbose=False,
config={
"max_depth": tune.grid_search([2, 4, 8]),
"min_child_weight": tune.grid_search([0, 0.01, 0.1])
})
My question:
How and when should data be read?
Some options:
- As a first stab, I could read the dataset in inside the body of
train_one_model
:
def train_one_model(configs):
df = load_data(directory)
train = df[df["week_index"].isin((1, 2))]
test = df[df["week_index"].isin((3, 4))]
This seems to be how many of the examples do it. However, this feels like a lots of unnecessary IO. I could save a bit of work by splitting the data on disc. But assuming I have O(10) workers and O(100) parameterisations, then each worker will still end up reading the same thing over and over.
- I could try tune.with_parameters . The example does vaguely look like what I want:
from ray import tune
def train(config, data=None):
for sample in data:
loss = update_model(sample)
tune.report(loss=loss)
data = HugeDataset(download=True)
tune.run(
tune.with_parameters(train, data=data),
# ...
)
By analogy with Spark, this feels weird - you wouldn’t typically read several gigs onto the driver and broadcast it out to your executors. But Ray isn’t Spark, and I don’t have a good mental model of Ray. Perhaps in the context of Ray this pattern makes sense?
- Something else. Is there a way to configure this so that the first time
train_one_model
gets called on a worker, it reads the data into that worker’s memory. Then in subsequent runs the data is already sitting there?
def train_one_model(configs):
if not df:
df = load_data(directory)
train = df[df["week_index"].isin((1, 2))]
test = df[df["week_index"].isin((3, 4))]
Can this be achieved via ray.put
and ray.get
? Would this be fundamentally different / better / worse than option 2?