Error in HuggingFaceTrainer (Transoformer) v2.4.0

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I followed the huggingface transformer trainer tutorial.

I had a valid code that was working natively using HF Transformers trainer lib and was experimenting with Ray.

I created a torch dataset object then created Ray datasets out of it:

ray_train_ds = ray.data.from_torch(train_dataset)
ray_evaluation_ds = ray.data.from_torch(test_dataset)

I then created the trainer_init_per_worker wrapper on the HF Trainer.

scaling_config = ScalingConfig(num_workers=1, use_gpu=use_gpu)
trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=scaling_config,
    datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds},
)

But when I call trainer.fit() it errors:

 File "/usr/local/lib/python3.8/site-packages/ray/train/huggingface/_huggingface_utils.py", line 75, in __iter__
    yield (0, {k: v for k, v in row.as_pydict().items()})
AttributeError: 'dict' object has no attribute 'as_pydict'

I did some debugging and noticed that the master branch has that single line of code changed from
yield (0, {k: v for k, v in row.as_pydict().items()}) to yield (0, {k: v for k, v in row.items()})

I think it might be a bug in 2.4.0 but I am curious how did people trained with it?

@MichaelAzmy Can you share what tutorial example you following? The trainer_init_per_worker has to return the HF Trainer with all relevant HF TrainingArguments.
Here is an example that shows it.

cc: @Yard1

@Jules_Damji That’s the tutorial I was following but I changed the dataset and models according to the task I am working on.

I just created a Torch Dataset object where __getitem__ returns a dict as normal, and it is same format that the model is expecting.

The TrainingArguments and HF Trainer are as expected, however the issue seems to be in Ray dataset iterator as you can see above, and when I traced back I found the master branch had some refactoring and the line that is erroring for me was actually fixed in the master branch but not the 2.4.0 version.

I would assume that it is failing on iterating the converted Ray → HF dataset where Ray is trying to case RayDatasetHFIterable row into pydict although it is already dict.

I believe the conversion from Ray to Torch should be agnostic to the data itself so issue is from the wrapper even before passing to the trainer.

That’s my HF Trainer code for reference.

from transformers import TrainingArguments, Trainer

use_gpu = False

def trainer_init_per_worker(train_dataset, eval_dataset, **config):
    id2label = {0: "pdp", 1 :"collection", 2: "other"}
    label2id = {label:id for id, label in id2label.items()}
    num_labels = len(id2label)

    model = MarkupLMForSequenceClassification.from_pretrained("microsoft/markuplm-base", id2label=id2label, label2id=label2id, num_labels=num_labels)

    args = TrainingArguments(
        output_dir=f"page-type-classifier-v1-{str(datetime.datetime.now())}",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=10,
        weight_decay=0.01,
        logging_strategy="epoch",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=False,
        no_cuda=(not use_gpu)
    )
    return Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

scaling_config = ScalingConfig(num_workers=1, use_gpu=use_gpu)
trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=scaling_config,
    datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds},
)
result = trainer.fit()

Also adding more context, the Ray dataset that I created from Torch dataset is not giving errors when i manually iterate through it. You can see that every row is a dict here.

And that’s what the model is expecting. Why does Ray tries to enforce casting to as_pydict although it is already a dict?

yield (0, {k: v for k, v in row.as_pydict().items()}) 

Hey @Jules_Damji any help to understand where the issue is?

Hi @MichaelAzmy , thanks for reporting the issue. If you use the fix on master which removes the as_pydict() conversion, does this unblock your use case? It looks like we made this change as a part of this PR to enable Ray Data strict mode by default for Ray 2.5.

Yes I got my way to use nightly version and I can verify that the bug is fixed. I found other bugs but will post them in separate posts. Thanks