Running Tune with nonparallel function

Nuno_Costa · May 17, 2021, 11:34pm

Hello everyone.

I’m looking to optimize a couple of parameters of a given function using Tune.

This function uses some files and can’t be parallelized. However I have coded the trainable in a way that allows to interleave the trials (using checkpoints) between file usages.

Overall, the training is working in the way I thought would work (1 trial opens a file and runs a “iteration”, next trial runs another “iteration” on the same file until all trials have ran 1 “iteration” on an open file, then the 2nd file is opened and the cycle repeats).

However, when using some search algorithms, usually the trials get terminated after 1 or 2 iterations, when in reality a full training should be about 30 iterations.

Any tips on how to get this to work? I’ll share some snippets of my code as well as my tune config

def trainable(config, checkpoint_dir=None):
    if checkpoint_dir:
       #here I load the current open file

   #here I execute my code and produce some metrics for Ray

    with tune.checkpoint_dir(step=current_file) as checkpoint_dir:
        #here I dump some variables

    yield {"metric": metric}

And here is my tune.run config

bohb_hyperband = HyperBandForBOHB(
    time_attr="training_iteration",
    max_t=100,
    reduction_factor=4,
    metric="metric",
    mode="max")
bohb_search = TuneBOHB(
    max_concurrent=1,
    metric="metric",
    mode="max")
return tune.run(trainable,
                config= # 2 uniform search spaces,
                name="bohb_search",
                scheduler=bohb_hyperband,
                stop={"training_iteration": 100},
                search_alg=bohb_search,
                num_samples=10,
                resources_per_trial={"cpu": 15} # I use this to limit trial execution to 1 at a time
                )

To note that each iteration also takes some time (20/40 seconds) and I’m unsure if that is affecting the training process.

Any help is appreciated!

rliaw · May 19, 2021, 5:38pm

Can you share more of your training code? if you have a yield you probably want to put that in a forloop.

Nuno_Costa · May 19, 2021, 6:09pm

Hey, I might not need to, as I do not have any forloops in my code.

I ran my code on the assumption that each iteration ran up until the checkpoint and then it paused, resuming from the beginning of the function on the next iteration, but it seems to not be the case.

Is there any way for me to guarantee that each “step” of the trainable function is run sequentally (as in, all trials run step 1, one by one, then all trials run step 2, one by one, etc.)? I thought about using local mode, but I am using an actor for global variable storage that makes this impossible.

rliaw · May 21, 2021, 1:43am

Hmm, I don’t think that’s possible unfortunately. Can you tell me exactly what you’re trying to do re: the writing?

Topic		Replies	Views
TuneBOHB does not search Ray Tune	1	469	April 18, 2021
Many paused jobs without progress when using TuneBOHB Ray Tune	3	279	August 28, 2024
Trouble with some results from Ray Tune	1	42	August 7, 2024
How ray tune hyperband schedule generates and stop trials?	0	229	May 5, 2023
Question - About tune stopping condition with PBT	6	502	February 21, 2023

Running Tune with nonparallel function

Related topics