Can't reproduce HF Transformers accuracy using Ray vs native GPU

I am using HF Transformers to finetune a model on a custom dataset.
I already had it working natively using HF on a GPU but I cannot reproduce same results on Ray.
I pickled the train and eval datasets to make sure everything is exactly same for then native vs Ray case.
Without Ray, it works fine and accuracy is changing across epochs, however, on Ray, model is always producing exactly same accuracy (up to 10th decimal point) on 10 epochs. I verified multiple times.
I am using same hyperparameters for both trials. I am also training on one node with one GPU on Ray so distributed training should not be the issue.
I am loading pickled Torch datasets like follows and then converting them to Ray datasets.
Native HF is using the torch datasets directly while Ray HF is using Ray datasets.

I noticed that the Torch datasets that have Tensors are converted to numpy arrays when converted to Ray, but technically Ray is converting them back to Tensors before calling the model and it is not giving any errors.

Any thoughts about what could be the issue?

train_dataset = pickle.loads(TRAIN_DATA)
dev_dataset = pickle.loads(EVAL_DATA)

ray_train_ds = ray.data.from_torch(train_dataset)
ray_dev_ds = ray.data.from_torch(dev_dataset)

Native HF working code

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="models/model_base_100_pages_10_epochs_3_classes",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    # tokenizer=tokenizer,
    # data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Ray HF Transformers version (code works without errors but accuracy is not changing):

from transformers import TrainingArguments, Trainer

use_gpu = True

def trainer_init_per_worker(train_dataset, eval_dataset, **config):
    id2label = {0: "pdp", 1 :"collection", 2: "other"}
    label2id = {label:id for id, label in id2label.items()}
    num_labels = len(id2label)

    model = MarkupLMForSequenceClassification.from_pretrained("microsoft/markuplm-base", id2label=id2label, label2id=label2id, num_labels=num_labels)

    args = TrainingArguments(
        output_dir="page-type-classifier-v1-test",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=10,
        weight_decay=0.01,
        logging_strategy="epoch",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=False,
        no_cuda=(not use_gpu)
    )
    return Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

scaling_config = ScalingConfig(num_workers=1, use_gpu=use_gpu)
trainer = TransoformersTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=scaling_config,
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
    datasets={"train": ray_train_ds, "evaluation": ray_dev_ds},
)
result = trainer.fit()

Your code generally looks correct.

My guess at the moment is that the dataset is not transformed correctly within Ray.

Can you try in your trainer_init_per_worker to, instead of using the train_dataset and eval_dataset that are passed to the function, load it again like this:

train_dataset = pickle.loads(TRAIN_DATA)
eval_dataset = pickle.loads(EVAL_DATA)

and see if this changes things?

If there’s a subset of the data you can share for us to reproduce the issue, that would be immensely helpful!