How to launch multi-node job with Ray Train?

Hi,

I understand that the Ray Train trainer allows to launch a PyTorch single-host, multi-GPU job via this command:

from ray.train import Trainer

trainer = Trainer(backend="torch", num_workers=4)
trainer.start()
results = trainer.run(train_func_distributed)
trainer.shutdown()

How to launch a PyTorch multi-host, multi-GPU job with Ray Train? shall we use MPI or something so that the trainer is launched on multiple nodes at the same time?

Hi @Lacruche,

This can be achieved by setting up your Ray Cluster with multiple nodes. Running the same script while connected to this cluster will then distribute the Ray Train workers across the hosts (and use DDP). With Ray Train, the same script that runs on 1 4-GPU node can be run on a cluster with 2 2-GPU nodes or a cluster with 4 1-GPU nodes!

In general, MPI is not supported. By default, if you instantiate your Trainer with use_gpu=True, then the workers will use NCCL as the communication backend. This can be overridden to use Gloo by instantiating the Trainer with a TorchConfig that has backend="gloo".

I’m using SageMaker-managed EC2, so I get a cluster of N EC2s already networked together. How can I create a Ray cluster on an already-running EC2 cluster? the Ray AWS doc seems to start from EC2 instance provisioning, which I don’t need

would choosing a host as a coordinating server and running the below command (from here) turn my N EC2s in a Ray cluster?
python coordinator_server.py --ips <list_of_node_ips> --port <PORT>

Ah I have yet to try that myself, but that seems reasonable. @Ameer_Haj_Ali can you confirm if using coordinator_server.py is a viable approach for creating a Ray cluster on an existing EC2 cluster?

it is silent when i useing ray train on ray train multi-node multi-gpu, example i useing 2 node → 2gpu,total 4gpu2, is is silent

(BaseWorkerMixin pid=20102, ip=10.0.100.15) 2022-03-11 01:47:32,207 INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25342, ip=10.0.100.14) 2022-03-11 01:47:32,327 INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25343, ip=10.0.100.14) 2022-03-11 01:47:32,369 INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25344, ip=10.0.100.14) 2022-03-11 01:47:32,457 INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25345, ip=10.0.100.14) 2022-03-11 01:47:32,455 INFO torch.py:247 – Wrapping provided model in DDP.

this is log

i run get result: “py-spy dump --native --pid 25344”
Process 25344: ray::BaseWorkerMixin
Python v3.8.12 (/home/gpu4/anaconda3/bin/python3.8)

Thread 25344 (idle): “MainThread”
epoll_wait (libc-2.27.so)
boost::asio::detail::epoll_reactor::run (ray/_raylet.so)
boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
boost::asio::detail::scheduler::run (ray/_raylet.so)
boost::asio::io_context::run (ray/_raylet.so)
ray::core::CoreWorker::RunTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcess::RunTaskExecutionLoop (ray/_raylet.so)
run_task_loop (ray/_raylet.so)
main_loop (ray/worker.py:430)
(ray/workers/default_worker.py:224)
Thread 25458 (idle): “ray_import_thread”
do_futex_wait (libpthread-2.27.so)
__new_sem_wait_slow (libpthread-2.27.so)
PyThread_acquire_lock_timed (python3.8)
lock_PyThread_acquire_lock (python3.8)
wait (threading.py:306)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:733)
_poll_locked (ray/_private/gcs_pubsub.py:270)
poll (ray/_private/gcs_pubsub.py:409)
_run (ray/_private/import_thread.py:71)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
clone (libc-2.27.so)
Thread 2165 (idle): “Thread-41”
epoll_wait (libc-2.27.so)
0x7f6069f29bba (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069fd142c (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069cc5575 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069d28e47 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069d5a505 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
channel_spin (grpc/_channel.py:1258)
0x7f6069c8caf9 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
clone (libc-2.27.so)
Please tell me how can I change it so that he can train normally? thank you.

Can you create a new topic and post a reproduction script?

yes,i can ,but I have submitted a bug in the github issue,you can go there to check and communicate with more people,thank you.