Extremely slow BO after random sampling ends

LucaCappelletti94 · January 15, 2021, 11:57am

After the initial random sampling process ends, during which I almost always have 100% GPU usage while using 40 parallel processes, and the actual BO starts I get almost no GPU usage which I find unreal when using so many processes. Ideally, I would need to have always 100% usage to avoid wasting cluster time.

I am tuning a Keras NN model using the TuneReportCallback object.

How may I proceed?

Thanks!

LucaCappelletti94 · January 15, 2021, 4:40pm

For the time being I am fixing this by using:

import os
import sys

os.environ.setdefault("TUNE_GLOBAL_CHECKPOINT_S", str(sys.maxsize))

I’m not sure it’s what I want, but it seems to do the trick.

kai · January 15, 2021, 6:41pm

Wow that seems really weird, especially that the checkpointing seems to fix it. Did you have a chance to look at the experiment checkpoints? Are trials fomr BO actually running (i.e. finishing) or are we basically stuck in global checkpointing?

LucaCappelletti94 · January 15, 2021, 6:45pm

The trials are killed by ASHA, except for that they would look like they are completing. The loss landscape seems to be extremely flat, so I am worried that I am doing something wrong with the hyper-parameters space and I have posed another question on this topic.

I am testing out in the meantime the HyperOpt Searcher to see if the same issue applies.

Topic		Replies	Views
Ray Tune event loop backlogged, slow with checkpointing Ray Tune	7	1621	September 28, 2021
How to debug performance bottlenecks	7	2428	March 18, 2021
Trouble with some results from Ray Tune	1	42	August 7, 2024
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	15	December 18, 2024
Concurrency using ray.tune, slurm and BOHB Ray Tune	5	577	April 20, 2022

Extremely slow BO after random sampling ends

Related topics