[Tune] Ray tune for multi gpu and multi node runs Hangs

f2010126 · August 24, 2023, 12:08pm

Hi,

I’m trying to deploy a ray tune run across a SLURM cluster. I am able to run it successfully as long as there is only 1 GPU per worker. However when my scaling config is a multi node multi GPU case, it hangs.

use_gpu=True
gpus_per_trial=2
scaling_config = ScalingConfig(
        # no of other nodes?
        num_workers=2, use_gpu=use_gpu, resources_per_worker={"CPU": 2, "GPU": gpus_per_trial}
    )

with this config, I expected 2 workers, each with 2 Gpus to run my training script. But it hangs during the DDP part I think.
I also tried using Ray-Lightning but no help there. I’m seeing similar posts but the suggestions there didn’t help.

Any advice is welcome.

Thank!

matthewdeng · August 26, 2023, 4:56pm

Can you share what the rest of your script looks like? Are you able to tell through the Ray Dashboard where it is hanging?

f2010126 · August 26, 2023, 7:32pm

Hi @matthewdeng ,

I’ve added the code to reproduce the issue here: [Tune] Multi GPU, Multi Node hyperparameter search not functioning · Issue #38505 · ray-project/ray · GitHub
Any suggestions would be really helpful. I’m quite blocked.

Warm regards

Topic		Replies	Views
Ray Train/Tune issue: concurrent trials conflict on GPU nodes Ray Tune	2	44	February 12, 2025
Training trials in parallel on multi-gpu machine Ray Tune	8	1696	August 23, 2021
Multi-gpu ray tune for hparams not parallelizing and only using first gpu	0	78	July 10, 2024
Ray Tune not running enouch processes Ray Tune	1	338	November 29, 2022
RL training stuck when using Ray Tune and GPU	4	361	January 26, 2023

[Tune] Ray tune for multi gpu and multi node runs Hangs

Related topics