Many paused jobs without progress when using TuneBOHB

I’m trying to limit the number of concurrent trials (my understanding is that concurrent is Paused + Running, is this correct?); however, setting max_concurrent_trials inside of TuneConfig, setting max_concurrent inside of the search_alg, and using a ConcurrencyLimiter still results in tune launching a seemingly infinite stream of PAUSED trials and only ever running a single iteration on any given job-- I have also tried setting batch=True in the ConcurrencyLimiter to no avail. I have yet to see it run 2+ iterations on any single trial.

I’m not sure what I should do to prevent it from launching so many Jobs that just sit in the PAUSED state. Any help would be greatly appreciated.

1 Like

The problem seems to be isolated to the HyperBandForBOHB scheduler.
I was running it like this:

    search_alg = TuneBOHB(metric='loss', mode='min')
    search_alg = ConcurrencyLimiter(search_alg, max_concurrent=num_workers*2, batch=True)
    scheduler = HyperBandForBOHB(
        metric='loss',
        mode='min',
        max_t=args.epochs
    )

@wyn Which Ray version are you running with?

I had the same issue. In my case, it was because max_t was in batches (2000), which led to BOHB determining the bracket size to be 841. Accordingly, I’d wait forever before it filled that bracket. Switching to training_iteration solved the issue.