Ray Train example with transformers

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

Low

I tested Ray Train transformer example.
(transformers_example — Ray v1.9.0)

And I executed a training with next command by refering here.

#!/bin/bash

export TASK_NAME=mrpc

python example.py \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --max_length 128 \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/$TASK_NAME/ \
  --address 'ray://[MY_RAY_CLIENT_URL]' \
  --num_workers 8 \
  --use_gpu

It works well but it doesn’t use all gpus. It use only one gpu.
How can I do?

Hey @77loopin, thanks for posting this! I took a look into this and it seems like there’s indeed a bug in the script. The accelerator is reading from an environment variable that isn’t being set!

Adding the following line to the start of the training function fixes it:

os.environ["LOCAL_RANK"] = str(ray.train.local_rank())

I made a fix for the example here.

1 Like

Thanks @matthewdeng
It was resolved!!