GPU Scaling configuration for Tensorflow/Horovod/Pytorch

Hi Team,

For the scaling configuration, I would like to understand the best practices for maximizing the utilization of GPU resources.

For my use case, I have 4 Nodes with 6 GPUs.

To run a Ray Train job using HorovodTrainer or TensorflowTrainer, I could specify num_workers=24, and resources_per_worker={“GPU”: 1}. But for TorchTrainer, I can only specify num_workers=4 and resources_per_worker={“GPU”: 6}. In both cases, I noticed all 24 GPUs were utilized.

To run a RayTune job with 2 trials to be executed in parallel using HorovodTrainer or TensorflowTrainer, I could specify num_workers=12, and resources_per_worker={“GPU”: 1}. But for TorchTrainer, I can only specify num_workers=2 and resources_per_worker={“GPU”: 6}. In both cases, I noticed all 24 GPUs were utilized.

I’m wondering why is the behavior different when all the three trainers are extending DataParallelTrainer. From my experiments noticed TorchTrainer cannot have num_workers greater than the number of nodes on the cluster. Is this behavior expected, if so what are the underlying guidelines to be followed for configuring the resources in the best way.

Any inputs are highly appreciated.

Thank you for your time!

Regards,
Vivek

not that I am aware of.

Curious what kind of errors do you run into when you specify num_workers > 4 for TorchTrainer case. Can you paste the output?

Sure @xwjiang2010

I did a quick experiment with 2 Nodes with 4 GPUs each. Configured num_workers: 3, resources_per_worker={“GPU”: 1} and ran Hyper param optimization using Ray Tune.

trainer = TorchTrainer(
                train_loop_per_worker=ray_job_args["train_loop_per_worker"],
                train_loop_config=ray_job_args["train_loop_config"],
                scaling_config=ScalingConfig(
                              num_workers=3,
                              use_gpu=True,
                              resources_per_worker={"GPU": 1}
                               ),
                torch_config=TorchConfig(timeout_s=120)
            )

tuner = tune.Tuner(
            trainable=trainer,
            tune_config=ray_job_args["tune_config"],
            run_config=ray_job_args["run_config"],
            param_space=ray_job_args["param_space"],
        )

It throws the below error

File "/tmp/ray/session_2023-04-10_19-26-51_844083_8/runtime_resources/working_dir_files/_ray_pkg_b1fb384cc51654bd/train/mnist_torch.py", line 94, in train_mnist
    model.to(device)
  File "/home/jobuser/build/drex-starter-kit/environments/satellites/python/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/home/jobuser/build/drex-starter-kit/environments/satellites/python/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/jobuser/build/drex-starter-kit/environments/satellites/python/lib/python3.10/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/home/jobuser/build/drex-starter-kit/environments/satellites/python/lib/python3.10/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Please let me know if you need any additional information.

Thanks again for your time.

I have never seen this error. I am wondering if it’s ray related or your set up may have some issue with pytorch ddp. I have seen bunch of “all CUDA-capable devices are busy or unavailable” errors on pytorch forum usually related to their torch/cuda versions.

all cuda-capable devices are busy or unavailable site:discuss.pytorch.org

1 Like