@rliaw there isn’t any error and checkpoint is being saved. my problem (or misunderstanding) is that while trying different hyperparameters I want ray to evaluate models based on their checkpoint not the last epoch.
for example here I ran a small experiment:
Number of trials: 4/4 (4 TERMINATED)
== Status ==
Memory usage on this node: 4.1/47.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/27.39 GiB heap, 0.0/9.42 GiB objects (0/1.0 accelerator_type:GTX)
Current best trial: ab6bc_00002 with val_ci=0.5972576141357422 and parameters={'h_size': 8, 'rec_lmb': 0.98, 'k': 30, 'lr': 0.001, 'l2': 5e-05, 'bins': 10}
Result logdir: ***
Number of trials: 4/4 (4 TERMINATED)
+---------------------+------------+-------+--------+----------+--------+------------------+------------+----------+
| Trial name | status | loc | bins | h_size | iter | total time (s) | val_loss | val_ci |
|---------------------+------------+-------+--------+----------+--------+------------------+------------+----------|
| DEFAULT_ab6bc_00000 | TERMINATED | | 10 | 16 | 1911 | 61.6466 | 0.993486 | 0.594613 |
| DEFAULT_ab6bc_00001 | TERMINATED | | 20 | 16 | 1830 | 61.8445 | 0.982069 | 0.56376 |
| DEFAULT_ab6bc_00002 | TERMINATED | | 10 | 8 | 3030 | 95.5939 | 0.999031 | 0.597258 |
| DEFAULT_ab6bc_00003 | TERMINATED | | 10 | 8 | 2207 | 74.0899 | 1.0091 | 0.556271 |
+---------------------+------------+-------+--------+----------+--------+------------------+------------+----------+
2021-01-02 14:39:53,452 INFO tune.py:444 -- Total run time: 160.61 seconds (158.75 seconds for the tuning loop).
{'h_size': 8, 'rec_lmb': 0.98, 'k': 30, 'lr': 0.001, 'l2': 5e-05, 'bins': 10}
/home/asalimi/projs/miR_Cox-PASNet/fscox/rayres/tune_test3/DEFAULT_ab6bc_00002_2_bins=10,h_size=8_2021-01-02_14-38-17/checkpoint_156/
the numbers in the table show metrics at last epoch (the number of iterations are different because of early stopping), as you can see at last line the analysis.best_checkpoint is for 00002. however when I look at the logs I see that 00000 at epoch 20 had the highest val_ci compared to others so I expect analysis.best_checkpoint be for 00000.
also I can see that for each trial two checkpoints are saved. one for the best best epoch one the last epoch (although i set checkpoint_at_end to False). and if I set chekpoint_at_end to True the following error occurs:
ValueError: 'checkpoint_at_end' cannot be used with a checkpointable function. You can specify and register checkpoints within your trainable function.
Here is my trainable function:
def train_for_tune(config, checkpoint_dir=None, data_dir=None, num_epochs=EPOCH_NUM, num_gpus=1):
# DataModule
dm = ...
# Model
if checkpoint_dir:
# Get Trained Model File
ckpt_file = os.path.join(checkpoint_dir, "checkpoint")
# Load Model (just for params)
model = Model.load_from_checkpoint(
checkpoint_path=ckpt_file,
input_dim=tcga_dm.n_genes,
u=tcga_dm.u_train
)
else:
model = Model(...)
#-----------Early Stopping------
early_stop_callback = EarlyStopping(
monitor='val_loss',
min_delta=0.001,
patience=1000,
verbose=False,
mode='min'
)
metrics = {"val_loss": "val_loss", "val_ci": "val_ci"}
tune_reporter = TuneReportCheckpointCallback(metrics, on="validation_end",filename="checkpoint")
trainer = Trainer(
max_epochs=num_epochs,
gpus=num_gpus,
callbacks = [tune_reporter, early_stop_callback],
logger = TensorBoardLogger(save_dir=tune.get_trial_dir(), name="", version="."),
progress_bar_refresh_rate=0,
log_every_n_steps=1,
num_sanity_val_steps=0
)
trainer.fit(model, dm)
I noticed that the checkpoint_dir if statement never happens.
Thanks in advance