Bug in Ray TransformerPredictor.from_checkpoint

MichaelAzmy · June 14, 2023, 5:10pm

TL;DR: TransformerPredictor.from_checkpoint is calling Pipeline with checkpoint_path instead of actual model instance.

I created a custom HF Transformer Pipeline, and it is working correctly on its own.

I then tried creating a TransformerPredictor on top of it by passing a TransformersCheckpoint and the custom Pipeline class:

from ray.train.huggingface import TransformersCheckpoint, TransformersPredictor

# Loading a trained HF transformer model
checkpoint = TransformersCheckpoint.from_directory("model_base_100_pages_10_epochs_3_classes_best/")

predictor = TransformersPredictor.from_checkpoint(checkpoint, pipeline_cls=PageTypeClassificationPipeline)

But this is giving me error, it is clear from error that from_checkpoint is passing the checkpoint_path as argument to Pipeline class constructor which expects an actual model (Check HF source code here) not a path.
The HF Pipeline is different from HF pipeline wrapper that can take model names as strings to construct actual pipelines.

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:1                                                                                    │
│                                                                                                  │
│ ❱ 1 TransformersPredictor.from_checkpoint(checkpoint, pipeline_cls=PageTypeClassificationPip     │
│   2                                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.8/site-packages/ray/train/huggingface/transformers/transformers_predictor │
│ .py:148 in from_checkpoint                                                                       │
│                                                                                                  │
│   145 │   │   with checkpoint.as_directory() as checkpoint_path:                                 │
│   146 │   │   │   # Tokenizer will be loaded automatically (no need to specify                   │
│   147 │   │   │   # `tokenizer=checkpoint_path`)                                                 │
│ ❱ 148 │   │   │   pipeline = pipeline_cls(model=checkpoint_path, **pipeline_kwargs)              │
│   149 │   │   return cls(                                                                        │
│   150 │   │   │   pipeline=pipeline,                                                             │
│   151 │   │   │   preprocessor=preprocessor,                                                     │
│                                                                                                  │
│ in __init__:6                                                                                    │
│                                                                                                  │
│    3                                                                                             │
│    4 class PageTypeClassificationPipeline(Pipeline):                                             │
│    5 │   def __init__(self, *args, **kwargs):                                                    │
│ ❱  6 │   │   super().__init__(*args, **kwargs)                                                   │
│    7 │   │   self.processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")       │
│    8 │                                                                                           │
│    9 │   def _sanitize_parameters(self, top_k=None):                                             │
│                                                                                                  │
│ /usr/local/lib/python3.8/site-packages/transformers/pipelines/base.py:756 in __init__            │
│                                                                                                  │
│    753 │   │   **kwargs,                                                                         │
│    754 │   ):                                                                                    │
│    755 │   │   if framework is None:                                                             │
│ ❱  756 │   │   │   framework, model = infer_framework_load_model(model, config=model.config)     │
│    757 │   │                                                                                     │
│    758 │   │   self.task = task                                                                  │
│    759 │   │   self.model = model                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'str' object has no attribute 'config'

To prove the point, I tried creating the TransformerPredictor from the constructor instead of from_checkpoint and it worked and I could do prediction. However, to use BatchPredictor, it forces using TransformerPredictor.from_checkpoint under the hood and I can’t get around it.

# This code works
from ray.train.huggingface import TransformersCheckpoint, TransformersPredictor

checkpoint= TransformersCheckpoint.from_directory("model_base_100_pages_10_epochs_3_classes_best/")
model = checkpoint.get_model(MarkupLMForSequenceClassification)
predictor = TransformersPredictor(pipeline=PageTypeClassificationPipeline(model=model))

I also created a custom MyTransformersPredictor by overriding TransformersPredictor’s from_checkpoint as follows and it worked.

Original:

       with checkpoint.as_directory() as checkpoint_path:
            # Tokenizer will be loaded automatically (no need to specify
            # `tokenizer=checkpoint_path`)
            pipeline = pipeline_cls(model=checkpoint_path, **pipeline_kwargs)
        return cls(
            pipeline=pipeline,
            preprocessor=preprocessor,
            use_gpu=use_gpu,```

Overriden:

        
        # with checkpoint.as_directory() as checkpoint_path:
            # Tokenizer will be loaded automatically (no need to specify
            # `tokenizer=checkpoint_path`)
            
        model = checkpoint.get_model(MarkupLMForSequenceClassification)
        pipeline = pipeline_cls(model=model, **pipeline_kwargs)
        return cls(
            pipeline=pipeline,
            preprocessor=preprocessor,
            use_gpu=use_gpu,
        )

Pipeline code for reference

from transformers import Pipeline
from transformers import MarkupLMProcessor

class PageTypeClassificationPipeline(Pipeline):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")

    def _sanitize_parameters(self, top_k=None):
        postprocess_params = {}
        if top_k is not None:
            postprocess_params["top_k"] = top_k
        return {}, {}, postprocess_params

    def preprocess(self, inputs):
        return self.processor(inputs, padding="max_length", max_length=512, truncation=True, return_tensors="pt")

    def _forward(self, model_inputs):
        return self.model(**model_inputs)

    def postprocess(self, model_outputs, top_k=1):
        if top_k > self.model.config.num_labels:
            top_k = self.model.config.num_labels
        # Unnest batch
        probs = model_outputs.logits[0].softmax(-1)
        scores, ids = probs.topk(top_k)
        scores = scores.tolist()
        ids = ids.tolist()

        return [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]

kai · June 16, 2023, 8:21am

Hi @MichaelAzmy,

thanks for bringin this up! This looks like a bug, and we’ll address it asap.

Can you use your workaround in the meantime?

kai · June 16, 2023, 9:16am

@MichaelAzmy I’ve filed a fix here:

github.com/ray-project/ray

[train] TransformersPredictor: Add support for custom pipeline class

ray-project:master ← krfricke:train/hf-predictor

opened 09:16AM - 16 Jun 23 UTC

krfricke

+78 -9

## Why are these changes needed? Creating a `TransformersPredictor` with a cu…stom pipeline class is currently broken: The model can't be automatically inferred from a path. This only works in the transformers pipeline. This PR adds support for this by adding additional parameters to `TransformersPredictor.from_checkpoint()` that will call `TransformersCheckpoint.get_model()` to retrieve the model, if specified. ## Related issue number Solves https://discuss.ray.io/t/bug-in-ray-transformerpredictor-from-checkpoint/11033/2 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

Could you try this out to see if it solves your problem?

MichaelAzmy · June 16, 2023, 4:11pm

Thanks for the quick fix. I checked the code and it looks good to me.
I will check when it is landed if everything worked as expected.

Topic		Replies	Views
AIR and HuggingFace Trainer's checkpoint paths are inconsistent Ray Libraries (Data, Train, Tune, Serve)	8	335	August 15, 2022
Access ray train checkpoint after training Ray Train	2	226	March 8, 2024
Restore from checkpoint gives tf not present error Checkpointing, Restoring	7	488	January 19, 2023
How to properly restore checkpoint when using Pytorch Lightning?	3	2524	April 27, 2022
Unable to create predictor from checkpoint Ray Tune	5	658	March 23, 2023

Bug in Ray TransformerPredictor.from_checkpoint

Related topics