Ray.tune - Best practices for reading datasets

Background:

  • I have a dataset containing eight weeks of data. It is O(10Gb) in size.
  • I want to train and validate on sliding windows:
    • Train on (W1, W2), Test on (W3, W4)
    • Train on (W2, W3), Test on (W4, W5)
    • etc etc
  • For each split, I want to perform grid search over O(100) hyperparameters
  • Optimal hyperparameters are chosen as those that give the best mean test score over all windows

I am assuming that ray.tune will be important here, so that I’ll do something like:

analysis = tune.run(
    train_one_model,
    verbose=False,
    config={
        "max_depth": tune.grid_search([2, 4, 8]),
        "min_child_weight": tune.grid_search([0, 0.01, 0.1])
    })

My question:

How and when should data be read?

Some options:

  1. As a first stab, I could read the dataset in inside the body of train_one_model:
def train_one_model(configs):
    df = load_data(directory)
    train = df[df["week_index"].isin((1, 2))]
    test = df[df["week_index"].isin((3, 4))]

This seems to be how many of the examples do it. However, this feels like a lots of unnecessary IO. I could save a bit of work by splitting the data on disc. But assuming I have O(10) workers and O(100) parameterisations, then each worker will still end up reading the same thing over and over.

  1. I could try tune.with_parameters . The example does vaguely look like what I want:
from ray import tune

def train(config, data=None):
    for sample in data:
        loss = update_model(sample)
        tune.report(loss=loss)

data = HugeDataset(download=True)

tune.run(
    tune.with_parameters(train, data=data),
    # ...
)

By analogy with Spark, this feels weird - you wouldn’t typically read several gigs onto the driver and broadcast it out to your executors. But Ray isn’t Spark, and I don’t have a good mental model of Ray. Perhaps in the context of Ray this pattern makes sense?

  1. Something else. Is there a way to configure this so that the first time train_one_model gets called on a worker, it reads the data into that worker’s memory. Then in subsequent runs the data is already sitting there?
def train_one_model(configs):
    if not df:
        df = load_data(directory)
    train = df[df["week_index"].isin((1, 2))]
    test = df[df["week_index"].isin((3, 4))]

Can this be achieved via ray.put and ray.get? Would this be fundamentally different / better / worse than option 2?

Via @Yard1 In the Slack: One of the ways you could achieve 3 would be by using the Class Trainable API with reuse_actors=True tune.run argument. You would do the data loading and splitting inside Trainable.setup , which would happen only once per worker. See more here - Training (tune.Trainable, tune.report) — Ray v1.10.0. Another way would be to perhaps use Ray Datasets to do distributed splitting before starting the run and then passing the Dataset objects to Trainables through with.parameters.

1 Like