Running Tune with nonparallel function

Hello everyone.

I’m looking to optimize a couple of parameters of a given function using Tune.

This function uses some files and can’t be parallelized. However I have coded the trainable in a way that allows to interleave the trials (using checkpoints) between file usages.

Overall, the training is working in the way I thought would work (1 trial opens a file and runs a “iteration”, next trial runs another “iteration” on the same file until all trials have ran 1 “iteration” on an open file, then the 2nd file is opened and the cycle repeats).

However, when using some search algorithms, usually the trials get terminated after 1 or 2 iterations, when in reality a full training should be about 30 iterations.

Any tips on how to get this to work? I’ll share some snippets of my code as well as my tune config

def trainable(config, checkpoint_dir=None):
    if checkpoint_dir:
       #here I load the current open file

   #here I execute my code and produce some metrics for Ray

    with tune.checkpoint_dir(step=current_file) as checkpoint_dir:
        #here I dump some variables

    yield {"metric": metric}

And here is my tune.run config

bohb_hyperband = HyperBandForBOHB(
    time_attr="training_iteration",
    max_t=100,
    reduction_factor=4,
    metric="metric",
    mode="max")
bohb_search = TuneBOHB(
    max_concurrent=1,
    metric="metric",
    mode="max")
return tune.run(trainable,
                config= # 2 uniform search spaces,
                name="bohb_search",
                scheduler=bohb_hyperband,
                stop={"training_iteration": 100},
                search_alg=bohb_search,
                num_samples=10,
                resources_per_trial={"cpu": 15} # I use this to limit trial execution to 1 at a time
                )

To note that each iteration also takes some time (20/40 seconds) and I’m unsure if that is affecting the training process.

Any help is appreciated!

Can you share more of your training code? if you have a yield you probably want to put that in a forloop.

Hey, I might not need to, as I do not have any forloops in my code.

I ran my code on the assumption that each iteration ran up until the checkpoint and then it paused, resuming from the beginning of the function on the next iteration, but it seems to not be the case.

Is there any way for me to guarantee that each “step” of the trainable function is run sequentally (as in, all trials run step 1, one by one, then all trials run step 2, one by one, etc.)? I thought about using local mode, but I am using an actor for global variable storage that makes this impossible.

Hmm, I don’t think that’s possible unfortunately. Can you tell me exactly what you’re trying to do re: the writing?