Ray Train example with transformers

77loopin · May 13, 2022, 6:06am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

Low

I tested Ray Train transformer example.
(transformers_example — Ray v1.9.0)

And I executed a training with next command by refering here.

#!/bin/bash

export TASK_NAME=mrpc

python example.py \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --max_length 128 \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/$TASK_NAME/ \
  --address 'ray://[MY_RAY_CLIENT_URL]' \
  --num_workers 8 \
  --use_gpu

It works well but it doesn’t use all gpus. It use only one gpu.
How can I do?

matthewdeng · May 16, 2022, 3:32am

Hey @77loopin, thanks for posting this! I took a look into this and it seems like there’s indeed a bug in the script. The accelerator is reading from an environment variable that isn’t being set!

Adding the following line to the start of the training function fixes it:

os.environ["LOCAL_RANK"] = str(ray.train.local_rank())

I made a fix for the example here.

77loopin · May 16, 2022, 5:45am

Thanks @matthewdeng
It was resolved!!

Topic		Replies	Views
Ray train examples are broken Ray Train	1	599	May 10, 2022
The results are different on windows and ubuntu Ray Train	8	560	April 11, 2023
Ray train on remote cluster	1	416	August 15, 2023
Ray train not work in pretrain model Ray Train	1	745	March 28, 2023
Wanna run two models(A,B) with 2 GPUs for 'A' and 1 GPU for 'B' Ray Core	0	35	June 12, 2024

Ray Train example with transformers

Related topics