BOHB Scheduler Restarting Experiments without providing checkpoint_dir

philipjb · May 4, 2021, 8:12pm

Hello, I am using Ray Tune (1.2.0) with the HyperBandForBOHB. I am also using the functional trainable api as that is a requirement to maintain compatibility with Horovod. I found that when I run more than one experiment concurrently, the experiments will restart without failing very frequently. I believe this is due to the scheduler pausing experiments but I don’t know how to confirm this, as no external logs say such. In my tests, I’m running two experiments total (num_samples=2), and both in parallel (max_concurrent=2 inside TuneBOHB), so I don’t know why it would need to pause any of the runs. I have also been tracking the checkpoint_dir parameter for my trainable function and it is always set to None. My understanding was that if the scheduler would pause and unpause an experiment, it would set this parameter to the experiment folder.

rliaw · May 5, 2021, 1:14am

Hey @philipjb, welcome! I think this is natural due to the behavior of TuneBOHB. Can you post the output of what you’re seeing as unintuitive?

philipjb · May 5, 2021, 1:39pm

There isn’t a specific output that is wrong. I don’t understand why it would need to be pausing and unpausing experiments when there are no extra experiments to run, but that part isn’t important. My concern is how I’m supposed to handle the experiment unpausing if no checkpoint_dir value is passed into the trainer function? Shouldn’t checkpoint_dir be a string value instead of None if it’s resuming an experiment (I need to reload the models)? The alternative is that it’s purposefully starting new experiments, but with the exact same configs. If that’s the case, then there’s something I’m not understanding about BOHB.

rliaw · May 5, 2021, 4:38pm

Ah so basically you’re using:

Horovod
Ray Tune
Function API

right?

Are you setting ray.tune.integration.horovod.distributed_checkpoint_dir in your code?

philipjb · May 5, 2021, 5:41pm

Yes to the first three points, though I’m currently testing without Horovod. Not setting the distributed_checkpoint_dir as I only checkpoint on the rank 0 device. What I mean is I can’t use the Trainable Class which has built in checkpointing because I use the DesitrbutedTrainableCreator which requires my trainer to be a function. Right now I’m testing without Horovod through to simplify things. So I’m passing the trainable function directly into tune.run and it’s defined:

def trainer(config, checkpoint_dir=None):
log(checkpoint_dir)
load tf models and data
for i in range(start, total_training_steps):
update model
yield results

I’m logging the checkpoint_dir and it’s always equal to None, even when the function gets recalled breaking out of the for yield loop. If it’s pausing and restarting checkpoint_dir should be a value right? And if it’s a new experiment the config values should be different (I have learning_rate = tune.uniform(.01, .1)).

rliaw · May 5, 2021, 5:48pm

Not setting the distributed_checkpoint_dir as I only checkpoint on the rank 0 device.

Ah, the distributed_checkpoint_dir will automatically only save a relevant checkpoint from rank 0.

If it’s pausing and restarting checkpoint_dir should be a value right?

Yes, if the checkpointing is implemented properly (via distributed_checkpoint_dir)

And if it’s a new experiment the config values should be different (I have learning_rate = tune.uniform(.01, .1)).

I think what is happening is that BOHB is actually just starting a new “rung” - which means it runs some of the same trials but for a longer period of time.

philipjb · May 5, 2021, 5:50pm

I think what is happening is that BOHB is actually just starting a new “rung”

That would make sense. So is it supposed to start from scratch or is there a way to know this is happening so I can reload and keep training from before?

rliaw · May 5, 2021, 7:08pm

So is it supposed to start from scratch or is there a way to know this is happening so I can reload and keep training from before?

If you use the distributed_checkpoint_dir, I believe Tune should be able to reload that checkpoint correctly. (Though I might be wrong… may be a good idea to test it first on some dummy model).

philipjb · May 5, 2021, 7:15pm

Thanks for your helpful comments. I’ve temporarily bypassed Horovod to work on this issue, so I pass the trainer function directly to tune.run. The function accepts the checkpoint_dir parameter like I mentioned, but maybe I should be using the tune.checkpoint_dir/tune.distributed_checkpoint dir instead instead of relying on the parameter? I’ll test that. I might need to make a separate forum post to make sure I understand the BOHB parameters all correctly.

rliaw · May 5, 2021, 7:38pm

ah I think the docs may not be clear here; you need both the parameter and the call.


def train_mnist(config, checkpoint_dir=False):
    ....
    for epoch in range(40):
        train(model, optimizer, train_loader, device)
        acc = test(model, test_loader, device)

        if epoch % 3 == 0:
            with distributed_checkpoint_dir(step=epoch) as checkpoint_dir:
                ...

philipjb · May 6, 2021, 7:26pm

Thanks. I think without using the call, it was not creating a checkpoint the Ray was aware of so it never passed in checkpoint_dir. This seems to explain what is going on perfectly. It’d be nice if the BOHB scheduler didn’t actually stop the experiment but just let it keep running for x more steps. To get everything working now, I save a checkpoint for myself every n steps and I save a latest checkpoint every step that overwrites itself. This allows the experiment to pick up exactly where it left off without saving large numbers of checkpoints. It would be nice if the scheduler had some way of telling the trainer when it needs to save.

philipjb · May 7, 2021, 3:55am

Note for others following: Based on this post you can build a custom context function to wrap the trainer for loop and this could automatically save a checkpoint on model cleanup.

Topic		Replies	Views
Many paused jobs without progress when using TuneBOHB Ray Tune	3	276	August 28, 2024
Running Tune with nonparallel function Ray Tune	3	300	May 21, 2021
BOHB restarting from scratch rather than last checkpoint with GCP syncing Ray Tune	5	351	March 25, 2022
TuneSearchCV - bohb Are you using HyperBandForBOHB as a scheduler?	7	507	October 6, 2022
I cannot resume a broken tune run	2	455	September 10, 2023

BOHB Scheduler Restarting Experiments without providing checkpoint_dir

Related topics