Hello,
I have a question related to tune.run()
. For sanity check l run it with one1 epoch and 2 training samples and 2 validations samples before scaling it to the whole dataset. However it seems that the training is endless. Any cue ?
Here is my code
analysis = tune.run(
train_fn_with_parameters,
metric="loss_validation",
mode="min",
config=config
num_samples=1,
resources_per_trial=resources_per_trial, # 16 cpus and 1. gpu
name="tune_model",
max_concurrent_trials=1,
scheduler=tune_scheduler,
)
In my screen l have the following:
== Status ==
Current time: 2022-04-06 08:34:53 (running for 00:02:30.33)
Memory usage on this node: 16.8/58.9 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 1.000: None
Resources requested: 14.0/16 CPUs, 1.0/1 GPUs, 0.0/26.36 GiB heap, 0.0/13.18 GiB objects
Result logdir: /home/tune_model
Number of trials: 1/1 (1 RUNNING)
±-------------------------±---------±------------------±----------±--------+
| Trial name | status | loc | kernel| lr |
|--------------------------±---------±------------------±----------±--------|
| run_training_f38ce_00000 | RUNNING | 10.132.0.48:25795 | 16 | 0.32865 |
±-------------------------±---------±------------------±----------±--------+
with running status since several hours where l suppose that the learning should terminate in less than one minute.
train_fn_with_parameters = ray.tune.function_runner.with_parameters(
build_model,
fixed_params=params,
train_paths=train_paths,
val_paths=val_paths,
saving_folder=saving_folder,
tensorboard=tensorboard,
)
tune_scheduler = tune.schedulers.ASHAScheduler(
max_t=params["nb_epochs"],
grace_period=1,
)
Thanks for your help.