How to debug performance bottlenecks

I see that the slow down is very significant when the Bayesian optimization process moves from the initial random search to the actual BO. I see that there is almost no GPU usage for some reason during the BO itself, after computing the initial random samples.

Since this seems an unrelated problem from the initial topic, I have opened another question.