Ray train on remote cluster

ali_khoshtinattt · August 12, 2023, 7:18am

Description

Hi, I want to use ray train library, and I want to train it on a remote cluster.
I tried to run the train script using ray job python SDK. but it says that you cannot upload your working dirctory.
So, I decided to use job CLI. When I use job cli, the job is blocked and no other logs will be generated. But our code is healthy and can be run locally.

I use this command:

ray job submit --address=http://"<some-address>"  --runtime-env=./runtime_env.yaml -- python train.py \           
    --model_name_or_path "<model-path>" \
    --train_file  "<dataset-path>"\
    --output_dir "result/test" \
    --num_train_epochs 1 \
    --save_steps 4906 \
    --per_device_train_batch_size 2 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --evaluation_strategy no \
    --eval_steps 125 \
    --pooler_type "cls" \
    --mlp_only_train \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16

The training is:

      scaling_config = ScalingConfig(num_workers=3, use_gpu=True, trainer_resources={"CPU": 1, "GPU": 1},)
      trainer = TransformersTrainer(
          trainer_init_per_worker=trainer_init_per_worker,
          scaling_config=scaling_config,
          datasets={"train": ray_train_ds, "evaluation": None},
      )
      model_path = (
          model_args.model_name_or_path
          if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
          else None
      )

      train_result = trainer.fit(model_path=model_path)

And the logs are:


Job submission server address: <some-address>
2023-08-12 10:16:25,923	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_f3591dde7d58bb2d.zip.
2023-08-12 10:16:25,923	INFO packaging.py:518 -- Creating a file package for local directory './'.
2023-08-12 10:16:25,948	WARNING packaging.py:393 -- File /home/user1/Documents/code/zibert/simcse/simcse/zimodels/zibert_v2/pytorch_model.bin is very large (91.28MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/user1/Documents/code/zibert/simcse/simcse/zimodels/zibert_v2/pytorch_model.bin']})`

-------------------------------------------------------
Job 'raysubmit_VTrKtS4M41KZnMLY' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_VTrKtS4M41KZnMLY
  Query the status of the job:
    ray job status raysubmit_VTrKtS4M41KZnMLY
  Request the job to be stopped:
    ray job stop raysubmit_VTrKtS4M41KZnMLY

Tailing logs until the job exits (disable with --no-wait):
2023-08-11 23:46:50.363506: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-11 23:46:51.595338: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-08-11 23:46:51.595474: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-08-11 23:46:51.595484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
08/11/2023 23:46:54 - INFO - torch.distributed.nn.jit.instantiator -   Created a temporary directory at /tmp/tmpg2wk1_1t
08/11/2023 23:46:54 - INFO - torch.distributed.nn.jit.instantiator -   Writing /tmp/tmpg2wk1_1t/_remote_module_non_scriptable.py

How can I submit this script?

matthewdeng · August 15, 2023, 4:57am

Hey @ali_khoshtinattt, I’ve replied on the Github issue - let’s centralize conversation there.

Topic		Replies	Views
Distributed pytorch on cluster Ray Clusters	4	546	June 9, 2021
How to submit a job to a local_mode cluster	3	561	February 27, 2021
Getting started with RLlib on a private cluster Ray Core	20	2777	April 28, 2021
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1135	January 12, 2022
Submitting job to remote AWS cluster Ray Core	3	261	April 5, 2024

Ray train on remote cluster

Description

Related topics