Ray.tune - Best practices for reading datasets

dfernig · February 18, 2022, 6:47am

Background:

I have a dataset containing eight weeks of data. It is O(10Gb) in size.
I want to train and validate on sliding windows:
- Train on (W1, W2), Test on (W3, W4)
- Train on (W2, W3), Test on (W4, W5)
- etc etc
For each split, I want to perform grid search over O(100) hyperparameters
Optimal hyperparameters are chosen as those that give the best mean test score over all windows

I am assuming that ray.tune will be important here, so that I’ll do something like:

analysis = tune.run(
    train_one_model,
    verbose=False,
    config={
        "max_depth": tune.grid_search([2, 4, 8]),
        "min_child_weight": tune.grid_search([0, 0.01, 0.1])
    })

My question:

How and when should data be read?

Some options:

As a first stab, I could read the dataset in inside the body of train_one_model:

def train_one_model(configs):
    df = load_data(directory)
    train = df[df["week_index"].isin((1, 2))]
    test = df[df["week_index"].isin((3, 4))]

This seems to be how many of the examples do it. However, this feels like a lots of unnecessary IO. I could save a bit of work by splitting the data on disc. But assuming I have O(10) workers and O(100) parameterisations, then each worker will still end up reading the same thing over and over.

I could try tune.with_parameters . The example does vaguely look like what I want:

from ray import tune

def train(config, data=None):
    for sample in data:
        loss = update_model(sample)
        tune.report(loss=loss)

data = HugeDataset(download=True)

tune.run(
    tune.with_parameters(train, data=data),
    # ...
)

By analogy with Spark, this feels weird - you wouldn’t typically read several gigs onto the driver and broadcast it out to your executors. But Ray isn’t Spark, and I don’t have a good mental model of Ray. Perhaps in the context of Ray this pattern makes sense?

Something else. Is there a way to configure this so that the first time train_one_model gets called on a worker, it reads the data into that worker’s memory. Then in subsequent runs the data is already sitting there?

def train_one_model(configs):
    if not df:
        df = load_data(directory)
    train = df[df["week_index"].isin((1, 2))]
    test = df[df["week_index"].isin((3, 4))]

Can this be achieved via ray.put and ray.get? Would this be fundamentally different / better / worse than option 2?

dfernig · February 18, 2022, 6:48am

Via @Yard1 In the Slack: One of the ways you could achieve 3 would be by using the Class Trainable API with reuse_actors=True tune.run argument. You would do the data loading and splitting inside Trainable.setup , which would happen only once per worker. See more here - Training (tune.Trainable, tune.report) — Ray v1.10.0. Another way would be to perhaps use Ray Datasets to do distributed splitting before starting the run and then passing the Dataset objects to Trainables through with.parameters.

Topic		Replies	Views
Shared dataset on a local desktop	1	288	March 7, 2023
Avoid moving datasets around the network when using tune.with_parameters Ray Tune	2	35	July 29, 2024
Handling large datasets results in error Ray Tune	4	1341	October 6, 2021
Ray Tune confidence interval Ray Tune	1	406	July 27, 2021
Where do I find documentation on the tune.run method	3	2092	June 12, 2023

Ray.tune - Best practices for reading datasets

Related topics