BOHB Scheduler Restarting Experiments without providing checkpoint_dir

Hello, I am using Ray Tune (1.2.0) with the HyperBandForBOHB. I am also using the functional trainable api as that is a requirement to maintain compatibility with Horovod. I found that when I run more than one experiment concurrently, the experiments will restart without failing very frequently. I believe this is due to the scheduler pausing experiments but I don’t know how to confirm this, as no external logs say such. In my tests, I’m running two experiments total (num_samples=2), and both in parallel (max_concurrent=2 inside TuneBOHB), so I don’t know why it would need to pause any of the runs. I have also been tracking the checkpoint_dir parameter for my trainable function and it is always set to None. My understanding was that if the scheduler would pause and unpause an experiment, it would set this parameter to the experiment folder.

Hey @philipjb, welcome! I think this is natural due to the behavior of TuneBOHB. Can you post the output of what you’re seeing as unintuitive?

There isn’t a specific output that is wrong. I don’t understand why it would need to be pausing and unpausing experiments when there are no extra experiments to run, but that part isn’t important. My concern is how I’m supposed to handle the experiment unpausing if no checkpoint_dir value is passed into the trainer function? Shouldn’t checkpoint_dir be a string value instead of None if it’s resuming an experiment (I need to reload the models)? The alternative is that it’s purposefully starting new experiments, but with the exact same configs. If that’s the case, then there’s something I’m not understanding about BOHB.

Ah so basically you’re using:

  1. Horovod
  2. Ray Tune
  3. Function API

right?

Are you setting ray.tune.integration.horovod.distributed_checkpoint_dir in your code?

Yes to the first three points, though I’m currently testing without Horovod. Not setting the distributed_checkpoint_dir as I only checkpoint on the rank 0 device. What I mean is I can’t use the Trainable Class which has built in checkpointing because I use the DesitrbutedTrainableCreator which requires my trainer to be a function. Right now I’m testing without Horovod through to simplify things. So I’m passing the trainable function directly into tune.run and it’s defined:

def trainer(config, checkpoint_dir=None):
log(checkpoint_dir)
load tf models and data
for i in range(start, total_training_steps):
update model
yield results

I’m logging the checkpoint_dir and it’s always equal to None, even when the function gets recalled breaking out of the for yield loop. If it’s pausing and restarting checkpoint_dir should be a value right? And if it’s a new experiment the config values should be different (I have learning_rate = tune.uniform(.01, .1)).

Not setting the distributed_checkpoint_dir as I only checkpoint on the rank 0 device.

Ah, the distributed_checkpoint_dir will automatically only save a relevant checkpoint from rank 0.

If it’s pausing and restarting checkpoint_dir should be a value right?

Yes, if the checkpointing is implemented properly (via distributed_checkpoint_dir)

And if it’s a new experiment the config values should be different (I have learning_rate = tune.uniform(.01, .1)).

I think what is happening is that BOHB is actually just starting a new “rung” - which means it runs some of the same trials but for a longer period of time.

I think what is happening is that BOHB is actually just starting a new “rung”

That would make sense. So is it supposed to start from scratch or is there a way to know this is happening so I can reload and keep training from before?

So is it supposed to start from scratch or is there a way to know this is happening so I can reload and keep training from before?

If you use the distributed_checkpoint_dir, I believe Tune should be able to reload that checkpoint correctly. (Though I might be wrong… may be a good idea to test it first on some dummy model).

Thanks for your helpful comments. I’ve temporarily bypassed Horovod to work on this issue, so I pass the trainer function directly to tune.run. The function accepts the checkpoint_dir parameter like I mentioned, but maybe I should be using the tune.checkpoint_dir/tune.distributed_checkpoint dir instead instead of relying on the parameter? I’ll test that. I might need to make a separate forum post to make sure I understand the BOHB parameters all correctly.

ah I think the docs may not be clear here; you need both the parameter and the call.


def train_mnist(config, checkpoint_dir=False):
    ....
    for epoch in range(40):
        train(model, optimizer, train_loader, device)
        acc = test(model, test_loader, device)

        if epoch % 3 == 0:
            with distributed_checkpoint_dir(step=epoch) as checkpoint_dir:
                ...

Thanks. I think without using the call, it was not creating a checkpoint the Ray was aware of so it never passed in checkpoint_dir. This seems to explain what is going on perfectly. It’d be nice if the BOHB scheduler didn’t actually stop the experiment but just let it keep running for x more steps. To get everything working now, I save a checkpoint for myself every n steps and I save a latest checkpoint every step that overwrites itself. This allows the experiment to pick up exactly where it left off without saving large numbers of checkpoints. It would be nice if the scheduler had some way of telling the trainer when it needs to save.

Note for others following: Based on this post you can build a custom context function to wrap the trainer for loop and this could automatically save a checkpoint on model cleanup.