How to make all use of the GPU memory in Ray.tune

I meet a problem in ray.tune, I tuning in 2 nodes(1node with 1 GPU, another node with 2 GPUs), each trial with resources of 32CPUs, 1GPU. The problem is ray.tune couldn’t make all use of the GPU memory(each GPU about 2.2GB/12GB is used with batch_size=64), so the tuning procedure is very slow.
Howerver, the TorchTrainer could make all use of the GPU memory(Each GPU use about 10GB GPU memory with batch_size=64). I said that because the train_func_per_worker in tune.Tuner and TorchTrainer is same , except that ray.tune doesn’t surpport the function train.torch.prepare_model & train.torch.prepare_data_loader , so I use if...else... to ignore it for tuner. So I’m confused. How to deal with it?

The TorchTrainer:

trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        train_loop_config={
            "args": args,
        },
        scaling_config=ScalingConfig(
            num_workers=args.ray_num_workers,  # The number of workers (Ray actors) to launch
            use_gpu=args.use_gpu,
        ),
        run_config=ray.air.RunConfig(
            progress_reporter=ray.tune.CLIReporter(max_report_frequency=600),
        ),
    )

The tune.Tuner:

tuner = tune.Tuner(
        tune.with_resources(
            train_func_per_worker,,
            {"cpu": args.num_workers, "gpu": args.gpus_per_trial}
           ),
        tune_config=tune.TuneConfig(
            metric="ADE",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
            max_concurrent_trials=args.ray_num_workers
        ),
        run_config=ray.air.RunConfig(
            progress_reporter=tune.CLIReporter(max_report_frequency=600),
            checkpoint_config = ray.air.config.CheckpointConfig(num_to_keep=2, checkpoint_score_attribute="ADE", 
                            checkpoint_score_order="min")
        ),
        param_space=config,
    )

In my case:

args.ray_num_workers = 3 #3 GPUs 3 processes
args.num_workers = 32 # Each process with 32CPUs

If you want to tune your TorchTrainer, you should pass it directly, like this:

trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        train_loop_config={
            "args": args,
        },
        scaling_config=ScalingConfig(
            num_workers=args.ray_num_workers,  # The number of workers (Ray actors) to launch
            use_gpu=args.use_gpu,
        ),
        run_config=ray.air.RunConfig(
            progress_reporter=ray.tune.CLIReporter(max_report_frequency=600),
        ),
    )
tuner = tune.Tuner(
        trainer,
        param_space={"train_loop_config": config},
        tune_config=tune.TuneConfig(
            metric="ADE",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
            max_concurrent_trials=args.ray_num_workers
        ),
        run_config=ray.air.RunConfig(
            progress_reporter=tune.CLIReporter(max_report_frequency=600),
            checkpoint_config = ray.air.config.CheckpointConfig(num_to_keep=2, checkpoint_score_attribute="ADE", 
                            checkpoint_score_order="min")
        ),
    )

This will initialize the correct distributed backends and use resources as intended.

Let me know if this is what you are trying to do, or if you are trying to reuse the training function but for workflows that are different between Train and Tune.

Thanks you very much! I just want to use TorchTrainer and tuner at the same time, in this way, I can almost reuse all the training function. Before your advise, I use a function as the trainable of tuner. Either of TorchTrainer or functional trainable is OK for me, but I meet the same problem- the GPU memory is not fully used for tuning while TorchTrainer is OK.

I tune TorchTrainer as you suggest, but the GPU memory is not fullly used too, So the training is very time-consuming( batch_size=64, each batch needs about 18s while TorchTrainer it only need about 0.6s). Below is one node:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:18:00.0  On |                  Off |
| 41%   39C    P8    18W / 140W |   3146MiB / 16376MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1393      G   /usr/lib/xorg/Xorg                400MiB |
|    0   N/A  N/A      2169      G   /usr/bin/gnome-shell              203MiB |
|    0   N/A  N/A      2709      G   ...989766915259071852,131072      118MiB |
|    0   N/A  N/A      3575      G   ...RendererForSitePerProcess      141MiB |
|    0   N/A  N/A      9476      G   /proc/self/exe                     29MiB |
|    0   N/A  N/A    133582      C   ...RayTrainWorker__execute()     2241MiB |
+-----------------------------------------------------------------------------+

When I directly use TorchTrainer, the node is like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:18:00.0  On |                  Off |
| 49%   72C    P2   121W / 140W |  10983MiB / 16376MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1393      G   /usr/lib/xorg/Xorg                390MiB |
|    0   N/A  N/A      2169      G   /usr/bin/gnome-shell               93MiB |
|    0   N/A  N/A      2709      G   ...989766915259071852,131072       65MiB |
|    0   N/A  N/A      3575      G   ...RendererForSitePerProcess      119MiB |
|    0   N/A  N/A      9476      G   /proc/self/exe                     26MiB |
|    0   N/A  N/A    136777      C   ...RayTrainWorker__execute()    10279MiB |

I notice another problem, the ray clusters is 2 nodes with 3GPUs, I run 3 trials(each with 1 GPU) at the same time, but the ray status shows: “Number of trials: 3/16 (2 PENDING, 1 RUNNING)”, why 2 trails are PENDING? As the log 3 GPUs are used and 3 trails all print some logs. Besides, each GPU need 32CPUs to prepare the dataloader, the log didn’t show the information correctly/

(RayTrainWorker pid=105253, ip=10.20.84.14) Epoch1[ 35/161]	Batch_Time 17.659 (18.278)	Data_Load_Time  0.000 ( 0.665)	Loss 1.2611e+05 (1.8929e+06)
(RayTrainWorker pid=105254, ip=10.20.84.14) Epoch1[ 35/161]	Batch_Time 17.665 (18.337)	Data_Load_Time  0.000 ( 0.641)	Loss 1.2611e+05 (1.8929e+06)
(RayTrainWorker pid=143085) Epoch1[ 24/161]	Batch_Time 25.957 (27.107)	Data_Load_Time  0.000 ( 0.977)	Loss 2.9558e+05 (2.6756e+06)
== Status ==
Current time: 2022-12-06 09:12:46 (running for 00:11:05.52)
Memory usage on this node: 27.5/62.5 GiB 
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 12.000: None | Iter 3.000: None
Resources requested: 1.0/152 CPUs, 3.0/3 GPUs, 0.0/120.81 GiB heap, 0.0/54.21 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/zetlin/Code/Prediction_ML/saved_model/vectornet/VectorNet_Tune_2022-12-06-09-01-40
Number of trials: 3/16 (2 PENDING, 1 RUNNING)
+--------------------------+----------+--------------------+------------------------+------------------------+------------------------+
| Trial name               | status   | loc                |   train_loop_config/lo |   train_loop_config/lo |   train_loop_config/lr |
|                          |          |                    |               ss_alpha |             ss_y_alpha |                        |
|--------------------------+----------+--------------------+------------------------+------------------------+------------------------|
| TorchTrainer_8b43a_00000 | RUNNING  | 10.20.84.14:105110 |               0.308202 |                3.86022 |            0.00119693  |
| TorchTrainer_8b43a_00001 | PENDING  |                    |               0.605744 |                4.78464 |            2.98891e-05 |
| TorchTrainer_8b43a_00002 | PENDING  |                    |               0.267458 |                4.39912 |            1.24705e-05 |
+--------------------------+----------+--------------------+------------------------+------------------------+------------------------+

This my fault, the ray status only show the tuner requested resources .
“2 PENDING” because I config 3 num_workers for TorchTrianer

You have helped me solve my problem, Please ignore my reply!That’s because of my fault

ray.tuneworks smoothly with TorchTrainer, That’s great!

1 Like