How to launch multi-node job with Ray Train?


I understand that the Ray Train trainer allows to launch a PyTorch single-host, multi-GPU job via this command:

from ray.train import Trainer

trainer = Trainer(backend="torch", num_workers=4)
results =

How to launch a PyTorch multi-host, multi-GPU job with Ray Train? shall we use MPI or something so that the trainer is launched on multiple nodes at the same time?

Hi @Lacruche,

This can be achieved by setting up your Ray Cluster with multiple nodes. Running the same script while connected to this cluster will then distribute the Ray Train workers across the hosts (and use DDP). With Ray Train, the same script that runs on 1 4-GPU node can be run on a cluster with 2 2-GPU nodes or a cluster with 4 1-GPU nodes!

In general, MPI is not supported. By default, if you instantiate your Trainer with use_gpu=True, then the workers will use NCCL as the communication backend. This can be overridden to use Gloo by instantiating the Trainer with a TorchConfig that has backend="gloo".

I’m using SageMaker-managed EC2, so I get a cluster of N EC2s already networked together. How can I create a Ray cluster on an already-running EC2 cluster? the Ray AWS doc seems to start from EC2 instance provisioning, which I don’t need

would choosing a host as a coordinating server and running the below command (from here) turn my N EC2s in a Ray cluster?
python --ips <list_of_node_ips> --port <PORT>

Ah I have yet to try that myself, but that seems reasonable. @Ameer_Haj_Ali can you confirm if using is a viable approach for creating a Ray cluster on an existing EC2 cluster?