Using fractional GPU with TorchTrainer and Tuner API

Hi, I have a job using the TorchTrainer API, that looks something like this:

trainer = TorchTrainer(...)
tuner = Tuner(trainable=trainer, ...)
tuner.fit()

Everything is working as expected, including training with a single worker on GPU. However, I am using only a single GPU, and the models are relatively small such that several could fit on the same GPU. I’d like to know how to allow the tuner to use fractional GPUs, so that I can run multiple concurrent trials at once.

The docs here seem to suggest wrapping the Trainer with tune.with_resources, but this doesn’t work with a Trainer, because Trainer doesn’t inherit from Trainable.

What’s the correct way to specify fractional GPU usage with the Tuner API and TorchTrainer (or Trainer more generally)?

You can specify resource requirements for a Trainer using a ScalingConfig: Configurations User Guide — Ray 2.1.0

In your case, you’d do:
TorchTrainer(..., scaling_config=ScalingConfig(resources_per_worker={"GPU": 0.5})).

Just wanted to say: thank you! This worked :slight_smile:

@cupe Hi, did you use ray.train.torch.prepare_model in the TorchTrainer? if not, could the fractional GPUs be used for multiple GPUs ?

Because my code show errors:

2023-08-21 09:01:33,024	ERROR trial_runner.py:993 -- Trial TorchTrainer_32216_00001: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1641548, ip=192.168.1.128, repr=TorchTrainer)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=1641581, ip=192.168.1.128, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f5cc4b4ee20>)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "multipathpp_train.py", line 442, in train_func_per_worker
    model = train.torch.prepare_model(model,parallel_strategy_kwargs={"find_unused_parameters":True})
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 120, in prepare_model
    return get_accelerator(_TorchAccelerator).prepare_model(
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 365, in prepare_model
    model = DataParallel(model, **parallel_strategy_kwargs)
  File "/home/xxx/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Result for TorchTrainer_32216_00001:
  date: 2023-08-21_09-01-20
  experiment_id: 010ad717f5004de69f94685a292af065
  hostname: xxx
  node_ip: 192.168.1.128
  pid: 1641548
  timestamp: 1692579680
  trial_id: '32216_00001'