Running tune with HF Transformers On Ray Project Image

Hello all,

Huge thanks to the team & community for maintaining this great library. Was hoping you guys might help me find what I’m missing in my implementation below.

I am presently attempting to run tune.run() with a trainable function that includes a Trainer instance from HuggingFace’s transformers library.

The code runs on my local machine, but now I am trying to make use of the ray/project image in order to run distributed training on a Kubernetes cluster.

I used the following Docker file to build my own image on top of the Ray image:

FROM rayproject/ray

# setup core compilers
RUN apt-get update && apt-get -y install gcc && apt-get -y install g++
RUN apt-get install inetutils-ping -y

# copy environment files
ENV PATH ${PATH}:/root/.local/bin
ENV PYTHONPATH ${PYTHONPATH}:/root/.local/bin
ADD requirements.txt /tmp/requirements.txt
RUN pip install --user -r /tmp/requirements.txt

# copy python script
RUN mkdir -p transformers_train
WORKDIR transformers_train
COPY ./ray_transformers.py ./ray_transformers.py
CMD python ray_transformers.py 

However, upon running the container, I encountered the following error:

The call to tune.run() is as follows:

def tune_trainer(trainer, scheduler, num_samples, config, progress_reporter=None,
                 use_checkpoints=True):
    """tune a trainer object """
    def tune_objective(trial, checkpoint_dir=None):
        """trainable for ray tune"""
        model_path = None
        stdout.write('DEBUG | Trainer checkpoint usage set to: ' + str(trainer.use_tune_checkpoints) + '\n')
        if checkpoint_dir:
            stdout.write('DEBUG | Tune checkpoint dir located at ... \n')
            stdout.write(checkpoint_dir)
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                    model_path = os.path.join(checkpoint_dir, subdir)
        trainer.objective = None
        trainer.train(model_path=model_path, trial=trial)

        setattr(trainer, 'compute_objective', default_compute_objective)
        if getattr(trainer, "objective", None) is None:
            metrics = trainer.evaluate()
            trainer.objective = trainer.compute_objective(metrics)
            trainer._tune_save_checkpoint()
            tune.report(objective=trainer.objective, **metrics, done=True)

    if not progress_reporter:
        progress_reporter = CLIReporter(metric_columns=["objective"])
    if use_checkpoints:
        trainer.use_tune_checkpoints = True

    # sync_config = tune.SyncConfig(
    #     sync_to_driver=NamespacedKubernetesSyncer("ray")
    # )
    tune.register_trainable("tune_objective", tune_objective)
    analysis = tune.run("tune_objective", scheduler=scheduler, num_samples=num_samples, config=config,
                        progress_reporter=progress_reporter, metric='objective', mode='min')
    return analysis

What I’ve Tried

  1. My first instinct upon following the traceback was to notice that the tune.run() logic, upon noticing my trainable is a function and not an experiment, wraps it around an Experiment instance.

The comments on the run_or_experiment argument mention that if my function is not an experiment, I have to register it with tune.register_trainable(“lambda_id”, lambda x: …) - so I attempted that but received the same error.

  1. Alternating between transformers 3.4 & 3.5

  2. Adding /root/.local/bin to the PYTHONPATH variable as pip informed me (during installation of requirement.txt) that some of the packages were being installed there. I suspected some of those packages may not have been called correctly because that directory wasn’t in PYTHONPATH, but the error persisted.

I will obviously keep investigating the issue, but I’m befuddled because the script works fine on my local machine.

It leads me to suspect the cause may be some really obvious environment configuration that I’m missing. If someone has a clue as to what I’m overlooking, I would be hugely grateful.

Much thanks!

cc @amogkam Can you answer this question?

Hey @Michael_Ma what version of Ray is this? Can you try upgrading to the latest nightly version of Ray and see if you are getting the issue?

Also, just wondering if you considered using the built in hyperparameter_search functionality in HF Trainer. You can see an example here https://huggingface.co/blog/ray-tune, but this might make things a lot simpler for you.

@sangcho and @amogkam, much thanks for the responses!

I have tried running the above with both version 1.0.1.post1 and the latest nightly wheels for linux. I have also tried not installing any version of Ray via the Dockerfile as I assumed there would already be an installation in the rayproject/ray image. Interestingly, when I run on the latest nightly wheel, another error shows up (possible informative?) above the original.

I have also tried the hyperparameter_search function from HF - which again works on my local image but when run on an extension of rayproject/ray, I receive the following, even more enigmatic error:

The picture above leads me to believe it really might be configuration based?