Hi,
I’m trying to deploy a ray tune run across a SLURM cluster. I am able to run it successfully as long as there is only 1 GPU per worker. However when my scaling config is a multi node multi GPU case, it hangs.
use_gpu=True
gpus_per_trial=2
scaling_config = ScalingConfig(
# no of other nodes?
num_workers=2, use_gpu=use_gpu, resources_per_worker={"CPU": 2, "GPU": gpus_per_trial}
)
with this config, I expected 2 workers, each with 2 Gpus to run my training script. But it hangs during the DDP part I think.
I also tried using Ray-Lightning but no help there. I’m seeing similar posts but the suggestions there didn’t help.
Any advice is welcome.
Thank!