Extremely slow BO after random sampling ends

After the initial random sampling process ends, during which I almost always have 100% GPU usage while using 40 parallel processes, and the actual BO starts I get almost no GPU usage which I find unreal when using so many processes. Ideally, I would need to have always 100% usage to avoid wasting cluster time.

I am tuning a Keras NN model using the TuneReportCallback object.

How may I proceed?


For the time being I am fixing this by using:

import os
import sys

os.environ.setdefault("TUNE_GLOBAL_CHECKPOINT_S", str(sys.maxsize))

I’m not sure it’s what I want, but it seems to do the trick.

Wow that seems really weird, especially that the checkpointing seems to fix it. Did you have a chance to look at the experiment checkpoints? Are trials fomr BO actually running (i.e. finishing) or are we basically stuck in global checkpointing?

The trials are killed by ASHA, except for that they would look like they are completing. The loss landscape seems to be extremely flat, so I am worried that I am doing something wrong with the hyper-parameters space and I have posed another question on this topic.

I am testing out in the meantime the HyperOpt Searcher to see if the same issue applies.