I meet a problem in ray.tune, I tuning in 2 nodes(1node with 1 GPU, another node with 2 GPUs), each trial with resources of 32CPUs, 1GPU. The problem is ray.tune couldn’t make all use of the GPU memory(each GPU about 2.2GB/12GB is used with batch_size=64), so the tuning procedure is very slow.
Howerver, the TorchTrainer could make all use of the GPU memory(Each GPU use about 10GB GPU memory with batch_size=64). I said that because the train_func_per_worker in tune.Tuner and TorchTrainer is same , except that ray.tune doesn’t surpport the function train.torch.prepare_model & train.torch.prepare_data_loader , so I use if...else... to ignore it for tuner. So I’m confused. How to deal with it?
The TorchTrainer:
trainer = TorchTrainer(
train_loop_per_worker=train_func_per_worker,
train_loop_config={
"args": args,
},
scaling_config=ScalingConfig(
num_workers=args.ray_num_workers, # The number of workers (Ray actors) to launch
use_gpu=args.use_gpu,
),
run_config=ray.air.RunConfig(
progress_reporter=ray.tune.CLIReporter(max_report_frequency=600),
),
)
This will initialize the correct distributed backends and use resources as intended.
Let me know if this is what you are trying to do, or if you are trying to reuse the training function but for workflows that are different between Train and Tune.
Thanks you very much! I just want to use TorchTrainer and tuner at the same time, in this way, I can almost reuse all the training function. Before your advise, I use a function as the trainable of tuner. Either of TorchTrainer or functional trainable is OK for me, but I meet the same problem- the GPU memory is not fully used for tuning while TorchTrainer is OK.
I tune TorchTrainer as you suggest, but the GPU memory is not fullly used too, So the training is very time-consuming( batch_size=64, each batch needs about 18s while TorchTrainer it only need about 0.6s). Below is one node:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:18:00.0 On | Off |
| 41% 39C P8 18W / 140W | 3146MiB / 16376MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1393 G /usr/lib/xorg/Xorg 400MiB |
| 0 N/A N/A 2169 G /usr/bin/gnome-shell 203MiB |
| 0 N/A N/A 2709 G ...989766915259071852,131072 118MiB |
| 0 N/A N/A 3575 G ...RendererForSitePerProcess 141MiB |
| 0 N/A N/A 9476 G /proc/self/exe 29MiB |
| 0 N/A N/A 133582 C ...RayTrainWorker__execute() 2241MiB |
+-----------------------------------------------------------------------------+
When I directly use TorchTrainer, the node is like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:18:00.0 On | Off |
| 49% 72C P2 121W / 140W | 10983MiB / 16376MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1393 G /usr/lib/xorg/Xorg 390MiB |
| 0 N/A N/A 2169 G /usr/bin/gnome-shell 93MiB |
| 0 N/A N/A 2709 G ...989766915259071852,131072 65MiB |
| 0 N/A N/A 3575 G ...RendererForSitePerProcess 119MiB |
| 0 N/A N/A 9476 G /proc/self/exe 26MiB |
| 0 N/A N/A 136777 C ...RayTrainWorker__execute() 10279MiB |
I notice another problem, the ray clusters is 2 nodes with 3GPUs, I run 3 trials(each with 1 GPU) at the same time, but the ray status shows: “Number of trials: 3/16 (2 PENDING, 1 RUNNING)”, why 2 trails are PENDING? As the log 3 GPUs are used and 3 trails all print some logs. Besides, each GPU need 32CPUs to prepare the dataloader, the log didn’t show the information correctly/
(RayTrainWorker pid=105253, ip=10.20.84.14) Epoch1[ 35/161] Batch_Time 17.659 (18.278) Data_Load_Time 0.000 ( 0.665) Loss 1.2611e+05 (1.8929e+06)
(RayTrainWorker pid=105254, ip=10.20.84.14) Epoch1[ 35/161] Batch_Time 17.665 (18.337) Data_Load_Time 0.000 ( 0.641) Loss 1.2611e+05 (1.8929e+06)
(RayTrainWorker pid=143085) Epoch1[ 24/161] Batch_Time 25.957 (27.107) Data_Load_Time 0.000 ( 0.977) Loss 2.9558e+05 (2.6756e+06)
== Status ==
Current time: 2022-12-06 09:12:46 (running for 00:11:05.52)
Memory usage on this node: 27.5/62.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 12.000: None | Iter 3.000: None
Resources requested: 1.0/152 CPUs, 3.0/3 GPUs, 0.0/120.81 GiB heap, 0.0/54.21 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/zetlin/Code/Prediction_ML/saved_model/vectornet/VectorNet_Tune_2022-12-06-09-01-40
Number of trials: 3/16 (2 PENDING, 1 RUNNING)
+--------------------------+----------+--------------------+------------------------+------------------------+------------------------+
| Trial name | status | loc | train_loop_config/lo | train_loop_config/lo | train_loop_config/lr |
| | | | ss_alpha | ss_y_alpha | |
|--------------------------+----------+--------------------+------------------------+------------------------+------------------------|
| TorchTrainer_8b43a_00000 | RUNNING | 10.20.84.14:105110 | 0.308202 | 3.86022 | 0.00119693 |
| TorchTrainer_8b43a_00001 | PENDING | | 0.605744 | 4.78464 | 2.98891e-05 |
| TorchTrainer_8b43a_00002 | PENDING | | 0.267458 | 4.39912 | 1.24705e-05 |
+--------------------------+----------+--------------------+------------------------+------------------------+------------------------+