AIR and HuggingFace Trainer's checkpoint paths are inconsistent

Hi:

Currently, I am using AIR 2.0.0rc0 and huggingface 4.21.0, following example from ray/huggingface_text_classification.ipynb at master · ray-project/ray · GitHub. I found that the saved checkpoint path is different from hf trainer logged ( Ray saved 16602773/ray_test/HuggingFaceTrainer_a062f_00000_0_2022-08-12_12-09-57/checkpoint_000000

HF trainer logged 16602773/checkpoint-109) . is there any way to fix this ?

Hey @richbrain thanks a bunch for opening this issue. AIR has its own checkpoint format but can be converted back to a HF trainer checkpoint.

Can you tell me more about what you’re trying to do? Would love to help you out.

yes. I want to use HF trainer to load the best checkpoint in the end, but it failed for the path issue. Is there any way we can do to fix this problem and make the checkpoint logic in HF trainer work ?

hmm, thanks for the context! Could you share a simple script to show me your workflow?

We’ll make sure we get it working + get it tested before our next release.

Thanks. As i said that i am following ray/huggingface_text_classification.ipynb at master · ray-project/ray · GitHub. For me, I have changed the args in block 14 to be

args = TrainingArguments(
        name,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=5,
        weight_decay=0.01,
        push_to_hub=False,
        load_best_model_at_end=True, # i added
        metric_for_best_model = "eval_loss", # i added
        greater_is_better = False, # i added
        disable_tqdm=True,  # declutter the output a little
        no_cuda=not use_gpu,  # you need to explicitly set no_cuda if you want CPUs
    )

Another thing is the checkpoint can be only saved on the end of epochs, it will be adjusted for saving on steps in the next release ?

the expect result is that the saved checkpoint path from hf trainer log is as same as where ray save the checkpoint and then the hf trainer can load the best model in the end. Thanks for your help.

Hmm, seems like perhaps you can take the generated checkpoint and do:

from ray.train.huggingface.huggingface_trainer import CHECKPOINT_PATH_ON_NODE_KEY

air_checkpoint = result.checkpoint
hf_checkpoint_path = air_checkpoint.to_dict()[CHECKPOINT_PATH_ON_NODE_KEY]

# your_hf_trainer = transformers.Trainer(...)
your_hf_trainer.train(resume_from_checkpoint=hf_checkpoint_path)

Let me know if that works for now?

from ray.train.huggingface import HuggingFaceCheckpoint

checkpoint = HuggingFaceCheckpoint.from_checkpoint(result.checkpoint)
hf_model = checkpoint.get_model(model=AutoModelForSequenceClassification)

This works.

I haven’t try this way. I will update later when i get result