Ray train on remote cluster


Hi, I want to use ray train library, and I want to train it on a remote cluster.
I tried to run the train script using ray job python SDK. but it says that you cannot upload your working dirctory.
So, I decided to use job CLI. When I use job cli, the job is blocked and no other logs will be generated. But our code is healthy and can be run locally.

I use this command:

ray job submit --address=http://"<some-address>"  --runtime-env=./runtime_env.yaml -- python train.py \           
    --model_name_or_path "<model-path>" \
    --train_file  "<dataset-path>"\
    --output_dir "result/test" \
    --num_train_epochs 1 \
    --save_steps 4906 \
    --per_device_train_batch_size 2 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --evaluation_strategy no \
    --eval_steps 125 \
    --pooler_type "cls" \
    --mlp_only_train \
    --temp 0.05 \
    --do_train \
    --do_eval \

The training is:

      scaling_config = ScalingConfig(num_workers=3, use_gpu=True, trainer_resources={"CPU": 1, "GPU": 1},)
      trainer = TransformersTrainer(
          datasets={"train": ray_train_ds, "evaluation": None},
      model_path = (
          if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
          else None

      train_result = trainer.fit(model_path=model_path)

And the logs are:

Job submission server address: <some-address>
2023-08-12 10:16:25,923	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_f3591dde7d58bb2d.zip.
2023-08-12 10:16:25,923	INFO packaging.py:518 -- Creating a file package for local directory './'.
2023-08-12 10:16:25,948	WARNING packaging.py:393 -- File /home/user1/Documents/code/zibert/simcse/simcse/zimodels/zibert_v2/pytorch_model.bin is very large (91.28MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/user1/Documents/code/zibert/simcse/simcse/zimodels/zibert_v2/pytorch_model.bin']})`

Job 'raysubmit_VTrKtS4M41KZnMLY' submitted successfully

Next steps
  Query the logs of the job:
    ray job logs raysubmit_VTrKtS4M41KZnMLY
  Query the status of the job:
    ray job status raysubmit_VTrKtS4M41KZnMLY
  Request the job to be stopped:
    ray job stop raysubmit_VTrKtS4M41KZnMLY

Tailing logs until the job exits (disable with --no-wait):
2023-08-11 23:46:50.363506: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-11 23:46:51.595338: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-08-11 23:46:51.595474: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-08-11 23:46:51.595484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
08/11/2023 23:46:54 - INFO - torch.distributed.nn.jit.instantiator -   Created a temporary directory at /tmp/tmpg2wk1_1t
08/11/2023 23:46:54 - INFO - torch.distributed.nn.jit.instantiator -   Writing /tmp/tmpg2wk1_1t/_remote_module_non_scriptable.py

How can I submit this script?

Hey @ali_khoshtinattt, I’ve replied on the Github issue - let’s centralize conversation there.