@cupe Hi, did you use ray.train.torch.prepare_model
in the TorchTrainer? if not, could the fractional GPUs be used for multiple GPUs ?
Because my code show errors:
2023-08-21 09:01:33,024 ERROR trial_runner.py:993 -- Trial TorchTrainer_32216_00001: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1641548, ip=192.168.1.128, repr=TorchTrainer)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 355, in train
raise skipped from exception_cause(skipped)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=1641581, ip=192.168.1.128, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f5cc4b4ee20>)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
raise skipped from exception_cause(skipped)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "multipathpp_train.py", line 442, in train_func_per_worker
model = train.torch.prepare_model(model,parallel_strategy_kwargs={"find_unused_parameters":True})
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 120, in prepare_model
return get_accelerator(_TorchAccelerator).prepare_model(
File "/home/xxx/miniconda3/lib/python3.8/site-packages/ray/train/torch/train_loop_utils.py", line 365, in prepare_model
model = DataParallel(model, **parallel_strategy_kwargs)
File "/home/xxx/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Result for TorchTrainer_32216_00001:
date: 2023-08-21_09-01-20
experiment_id: 010ad717f5004de69f94685a292af065
hostname: xxx
node_ip: 192.168.1.128
pid: 1641548
timestamp: 1692579680
trial_id: '32216_00001'