Description
Hi, I want to use ray train library, and I want to train it on a remote cluster.
I tried to run the train script using ray job python SDK. but it says that you cannot upload your working dirctory.
So, I decided to use job CLI. When I use job cli, the job is blocked and no other logs will be generated. But our code is healthy and can be run locally.
I use this command:
ray job submit --address=http://"<some-address>" --runtime-env=./runtime_env.yaml -- python train.py \
--model_name_or_path "<model-path>" \
--train_file "<dataset-path>"\
--output_dir "result/test" \
--num_train_epochs 1 \
--save_steps 4906 \
--per_device_train_batch_size 2 \
--learning_rate 3e-5 \
--max_seq_length 32 \
--evaluation_strategy no \
--eval_steps 125 \
--pooler_type "cls" \
--mlp_only_train \
--temp 0.05 \
--do_train \
--do_eval \
--fp16
The training is:
scaling_config = ScalingConfig(num_workers=3, use_gpu=True, trainer_resources={"CPU": 1, "GPU": 1},)
trainer = TransformersTrainer(
trainer_init_per_worker=trainer_init_per_worker,
scaling_config=scaling_config,
datasets={"train": ray_train_ds, "evaluation": None},
)
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
train_result = trainer.fit(model_path=model_path)
And the logs are:
Job submission server address: <some-address>
2023-08-12 10:16:25,923 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_f3591dde7d58bb2d.zip.
2023-08-12 10:16:25,923 INFO packaging.py:518 -- Creating a file package for local directory './'.
2023-08-12 10:16:25,948 WARNING packaging.py:393 -- File /home/user1/Documents/code/zibert/simcse/simcse/zimodels/zibert_v2/pytorch_model.bin is very large (91.28MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/user1/Documents/code/zibert/simcse/simcse/zimodels/zibert_v2/pytorch_model.bin']})`
-------------------------------------------------------
Job 'raysubmit_VTrKtS4M41KZnMLY' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_VTrKtS4M41KZnMLY
Query the status of the job:
ray job status raysubmit_VTrKtS4M41KZnMLY
Request the job to be stopped:
ray job stop raysubmit_VTrKtS4M41KZnMLY
Tailing logs until the job exits (disable with --no-wait):
2023-08-11 23:46:50.363506: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-11 23:46:51.595338: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-08-11 23:46:51.595474: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-08-11 23:46:51.595484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
08/11/2023 23:46:54 - INFO - torch.distributed.nn.jit.instantiator - Created a temporary directory at /tmp/tmpg2wk1_1t
08/11/2023 23:46:54 - INFO - torch.distributed.nn.jit.instantiator - Writing /tmp/tmpg2wk1_1t/_remote_module_non_scriptable.py
How can I submit this script?