[Tune] Ray tune for multi gpu and multi node runs Hangs


I’m trying to deploy a ray tune run across a SLURM cluster. I am able to run it successfully as long as there is only 1 GPU per worker. However when my scaling config is a multi node multi GPU case, it hangs.

scaling_config = ScalingConfig(
        # no of other nodes?
        num_workers=2, use_gpu=use_gpu, resources_per_worker={"CPU": 2, "GPU": gpus_per_trial}

with this config, I expected 2 workers, each with 2 Gpus to run my training script. But it hangs during the DDP part I think.
I also tried using Ray-Lightning but no help there. I’m seeing similar posts but the suggestions there didn’t help.

Any advice is welcome.

Thank! :slight_smile:

Can you share what the rest of your script looks like? Are you able to tell through the Ray Dashboard where it is hanging?

Hi @matthewdeng ,

I’ve added the code to reproduce the issue here: [Tune] Multi GPU, Multi Node hyperparameter search not functioning · Issue #38505 · ray-project/ray · GitHub
Any suggestions would be really helpful. I’m quite blocked.

Warm regards :slight_smile: