AIR and HuggingFace Trainer's checkpoint paths are inconsistent

richbrain · August 12, 2022, 10:41am

Hi:

Currently， I am using AIR 2.0.0rc0 and huggingface 4.21.0, following example from https://github.com/ray-project/ray/blob/master/doc/source/ray-air/examples/huggingface_text_classification.ipynb. I found that the saved checkpoint path is different from hf trainer logged ( Ray saved 16602773/ray_test/HuggingFaceTrainer_a062f_00000_0_2022-08-12_12-09-57/checkpoint_000000

HF trainer logged 16602773/checkpoint-109) . is there any way to fix this ?

rliaw · August 12, 2022, 8:53pm

Hey @richbrain thanks a bunch for opening this issue. AIR has its own checkpoint format but can be converted back to a HF trainer checkpoint.

Can you tell me more about what you’re trying to do? Would love to help you out.

richbrain · August 14, 2022, 8:10am

yes. I want to use HF trainer to load the best checkpoint in the end, but it failed for the path issue. Is there any way we can do to fix this problem and make the checkpoint logic in HF trainer work ?

rliaw · August 15, 2022, 12:19am

hmm, thanks for the context! Could you share a simple script to show me your workflow?

We’ll make sure we get it working + get it tested before our next release.

richbrain · August 15, 2022, 2:01am

Thanks. As i said that i am following ray/huggingface_text_classification.ipynb at master · ray-project/ray · GitHub. For me, I have changed the args in block 14 to be

args = TrainingArguments(
        name,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=5,
        weight_decay=0.01,
        push_to_hub=False,
        load_best_model_at_end=True, # i added
        metric_for_best_model = "eval_loss", # i added
        greater_is_better = False, # i added
        disable_tqdm=True,  # declutter the output a little
        no_cuda=not use_gpu,  # you need to explicitly set no_cuda if you want CPUs
    )

Another thing is the checkpoint can be only saved on the end of epochs, it will be adjusted for saving on steps in the next release ?

richbrain · August 15, 2022, 7:04am

the expect result is that the saved checkpoint path from hf trainer log is as same as where ray save the checkpoint and then the hf trainer can load the best model in the end. Thanks for your help.

rliaw · August 15, 2022, 8:20am

Hmm, seems like perhaps you can take the generated checkpoint and do:

from ray.train.huggingface.huggingface_trainer import CHECKPOINT_PATH_ON_NODE_KEY

air_checkpoint = result.checkpoint
hf_checkpoint_path = air_checkpoint.to_dict()[CHECKPOINT_PATH_ON_NODE_KEY]

# your_hf_trainer = transformers.Trainer(...)
your_hf_trainer.train(resume_from_checkpoint=hf_checkpoint_path)

Let me know if that works for now?

richbrain · August 15, 2022, 10:49am

from ray.train.huggingface import HuggingFaceCheckpoint

checkpoint = HuggingFaceCheckpoint.from_checkpoint(result.checkpoint)
hf_model = checkpoint.get_model(model=AutoModelForSequenceClassification)

This works.

richbrain · August 15, 2022, 10:50am

I haven’t try this way. I will update later when i get result

Topic		Replies	Views
Ray Tune error resuming training from an AIR checkpoint	0	199	September 6, 2023
Bug in Ray TransformerPredictor.from_checkpoint	3	371	June 16, 2023
Error in HuggingFaceTrainer (Transoformer) v2.4.0 Ray Data	6	827	June 9, 2023
Unable to restore fully trained checkpoint RLlib	19	2927	October 21, 2023
Tuner.fit().get_best_result has no checkpoints (None) Ray Tune	4	621	August 26, 2024

AIR and HuggingFace Trainer's checkpoint paths are inconsistent

Related topics