It seems strange that my Ray Tune session just got a bunch of trials paused and not resuming, could someone shed some lights on this problem…? Many thanks!
== Status ==
Memory usage on this node: 2.5/59.9 GiB
Using HyperBand: num_stopped=21 total_brackets=5
Round #0:
Bracket(Max Size (n)=8, Milestone (r)=2000, completed=100.0%): {TERMINATED: 8}
Bracket(Max Size (n)=4, Milestone (r)=1334, completed=100.0%): {TERMINATED: 12}
Bracket(Max Size (n)=3, Milestone (r)=1334, completed=100.0%): {TERMINATED: 3}
Bracket(Max Size (n)=2, Milestone (r)=1334, completed=100.0%): {TERMINATED: 15}
Bracket(Max Size (n)=44, Milestone (r)=72, completed=6.2%): {PAUSED: 15}
Resources requested: 0/8 CPUs, 0/1 GPUs, 0.0/35.45 GiB heap, 0.0/17.73 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /home/calvin_chan/data/output/checkpoint/out
Number of trials: 53/125 (15 PAUSED, 38 TERMINATED)
here are the Ray Tune configuration details…
reporter = tune.JupyterNotebookReporter(overwrite=True, max_progress_rows=35, metric_columns= report_metrics)
scheduler = HyperBandScheduler(metric="ohpl", mode="min", max_t=num_epochs)
searchopt = BasicVariantGenerator(max_concurrent=15)
result = tune.run(
tune.with_parameters(train_network_raytune,
num_in_feat = N_FEATURE,
num_epochs = num_epochs,
train_dataset = dataset_train,
valid_dataset = dataset_valid,
),
config = config,
resources_per_trial={"cpu": 1},
num_samples = num_hp_search_samples,
local_dir = chkpt_dir,
progress_reporter = reporter,
scheduler = scheduler,
search_alg = searchopt,
)