Everything is working as expected, including training with a single worker on GPU. However, I am using only a single GPU, and the models are relatively small such that several could fit on the same GPU. I’d like to know how to allow the tuner to use fractional GPUs, so that I can run multiple concurrent trials at once.
The docs here seem to suggest wrapping the Trainer with tune.with_resources, but this doesn’t work with a Trainer, because Trainer doesn’t inherit from Trainable.
What’s the correct way to specify fractional GPU usage with the Tuner API and TorchTrainer (or Trainer more generally)?
2023-08-21 09:01:33,024 ERROR trial_runner.py:993 -- Trial TorchTrainer_32216_00001: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1641548, ip=192.168.1.128, repr=TorchTrainer)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 355, in train
raise skipped from exception_cause(skipped)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=1641581, ip=192.168.1.128, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f5cc4b4ee20>)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
raise skipped from exception_cause(skipped)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "multipathpp_train.py", line 442, in train_func_per_worker
model = train.torch.prepare_model(model,parallel_strategy_kwargs={"find_unused_parameters":True})
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 120, in prepare_model
return get_accelerator(_TorchAccelerator).prepare_model(
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 365, in prepare_model
model = DataParallel(model, **parallel_strategy_kwargs)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Result for TorchTrainer_32216_00001:
date: 2023-08-21_09-01-20
experiment_id: 010ad717f5004de69f94685a292af065
hostname: xxx
node_ip: 192.168.1.128
pid: 1641548
timestamp: 1692579680
trial_id: '32216_00001'