How to use fraction GPU in `ray.tune.Tuner`?

I use ray.tune.Tuner where the trainable is a TorchTrainer(the basic code structure as follows). I have 1 GPU, I want to run 2 tune trials at one GPU at the same time, so only 0.5GPU for one trial. I only find tune.with_resources to set fraction GPU in the docs , but when the trainable is a TorchTrainer, which don’t works.

tuner = tune.Tuner(
        trainer, # TorchTrainer
        tune_config=tune.TuneConfig(
            metric="best_fde",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
            reuse_actors=False
        ),
        run_config=ray.air.RunConfig(
            name = tuner_dir_name, # dir name
            progress_reporter=tune.CLIReporter(max_report_frequency=600),
        ),
        param_space={"train_loop_config": config,
                    #  "scaling_config": ray.air.config.ScalingConfig(
                    #     num_workers=2,
                    #     resources_per_worker={
                    #         "CPU": 4, 
                    #         "GPU":0.5
                        # }
                        # ),
                     }
    )
    results = tuner.fit()

I also test to set fraction GPU in the TorchTrainer, which also didn’t work as expected.

trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        train_loop_config={
            "args": args,
        },
        scaling_config=ScalingConfig(
            num_workers=2,  # The number of workers (Ray actors) to launch
            use_gpu=args.use_gpu,
            resources_per_worker={"GPU":0.5}, 
        ),
        run_config=ray.air.RunConfig(
            progress_reporter=ray.tune.CLIReporter(max_report_frequency=600),
        ),
    )

So I don’t know how to appropriatelly set the resource parameters. Please help me.

Hey! What is the total amount of resources you want to schedule for a single TorchTrainer?

Based on your current ScalingConfig, each Trainer/Trial will request a total of

trainer_resources + num_workers * resources_per_worker

where

trainer_resources=1 CPU
num_workers=2
resources_per_worker=0.5 GPU

So in this case, you’ll be requesting a total of 1 CPU and 1 GPU - does that match the output you’re seeing in the console output?

This will have another error train.torch.prepare_model when as follows:

2023-08-21 09:01:33,024	ERROR trial_runner.py:993 -- Trial TorchTrainer_32216_00001: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1641548, ip=192.168.1.128, repr=TorchTrainer)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=1641581, ip=192.168.1.128, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f5cc4b4ee20>)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "multipathpp_train.py", line 442, in train_func_per_worker
    model = train.torch.prepare_model(model,parallel_strategy_kwargs={"find_unused_parameters":True})
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 120, in prepare_model
    return get_accelerator(_TorchAccelerator).prepare_model(
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 365, in prepare_model
    model = DataParallel(model, **parallel_strategy_kwargs)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Result for TorchTrainer_32216_00001:
  date: 2023-08-21_09-01-20
  experiment_id: 010ad717f5004de69f94685a292af065
  hostname: xxx
  node_ip: 192.168.1.128
  pid: 1641548
  timestamp: 1692579680
  trial_id: '32216_00001'

But I don’t know how to solve it.

environment:

ubuntu 20.04
pytorch: 1.10.2+cu113
ray: 2.1.0

Hmm I think this is because you end up scheduling both workers on the same GPU. Is there a reason you are trying to run DistributedDataParallel on fractional GPUs here?

The situation is that: the model is sensitive to the batch_size, so I want to use a small batch_size, so I can run more trials at the same GPU at the same time.

if for this situation, I shouldn’t use train.torch.prepare_model? However, if I don’t use train.torch.prepare_model, it as if one trial is running 2 times at the same time.

Besides, if I want to run 4 trials in 2 GPUs at the same time, what shoulde I do?

If you want to run 4 trials on a total of 2GPUs, all at the same time – you should do:

tune.Tuner(
    tune.with_resources(train_func, resources={"gpu": 0.5}), tune_config={..}, ...)

Without TorchTrainer. Tuner can be used by itself for models that don’t need distributed training.

Thank you @rliaw

I find another solution to do this with TorchTrainer, I need to set torch_backend = 'gloo', which can run several trails on one GPU at the same time. However, it cost more time. when I run 4 trials on 2 GPUs, every batch cost 2s for each trial; every batch just cost 0.35s when run 1 trials on each GPU.

As my code used TorchTrainer in Tuner, so I can reuse most of the code to fulfill TorchTrainer training and Tuner training. If I use the tune.with_resources, could the time issues be solved?

That is to say, I couldn’t run 4 trials on a total of 2GPUs with TorchTrainer efficiently?