Currently, I am using AIR 2.0.0rc0 and huggingface 4.21.0, following example from ray/huggingface_text_classification.ipynb at master · ray-project/ray · GitHub. I found that the saved checkpoint path is different from hf trainer logged ( Ray saved 16602773/ray_test/HuggingFaceTrainer_a062f_00000_0_2022-08-12_12-09-57/checkpoint_000000
HF trainer logged 16602773/checkpoint-109) . is there any way to fix this ?
yes. I want to use HF trainer to load the best checkpoint in the end, but it failed for the path issue. Is there any way we can do to fix this problem and make the checkpoint logic in HF trainer work ?
args = TrainingArguments(
name,
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
weight_decay=0.01,
push_to_hub=False,
load_best_model_at_end=True, # i added
metric_for_best_model = "eval_loss", # i added
greater_is_better = False, # i added
disable_tqdm=True, # declutter the output a little
no_cuda=not use_gpu, # you need to explicitly set no_cuda if you want CPUs
)
Another thing is the checkpoint can be only saved on the end of epochs, it will be adjusted for saving on steps in the next release ?
the expect result is that the saved checkpoint path from hf trainer log is as same as where ray save the checkpoint and then the hf trainer can load the best model in the end. Thanks for your help.