How to launch multi-node job with Ray Train?

Lacruche · December 29, 2021, 6:28pm

Hi,

I understand that the Ray Train trainer allows to launch a PyTorch single-host, multi-GPU job via this command:

from ray.train import Trainer

trainer = Trainer(backend="torch", num_workers=4)
trainer.start()
results = trainer.run(train_func_distributed)
trainer.shutdown()

How to launch a PyTorch multi-host, multi-GPU job with Ray Train? shall we use MPI or something so that the trainer is launched on multiple nodes at the same time?

matthewdeng · December 29, 2021, 7:06pm

Hi @Lacruche,

This can be achieved by setting up your Ray Cluster with multiple nodes. Running the same script while connected to this cluster will then distribute the Ray Train workers across the hosts (and use DDP). With Ray Train, the same script that runs on 1 4-GPU node can be run on a cluster with 2 2-GPU nodes or a cluster with 4 1-GPU nodes!

In general, MPI is not supported. By default, if you instantiate your Trainer with use_gpu=True, then the workers will use NCCL as the communication backend. This can be overridden to use Gloo by instantiating the Trainer with a TorchConfig that has backend="gloo".

Lacruche · December 30, 2021, 1:52pm

I’m using SageMaker-managed EC2, so I get a cluster of N EC2s already networked together. How can I create a Ray cluster on an already-running EC2 cluster? the Ray AWS doc seems to start from EC2 instance provisioning, which I don’t need

would choosing a host as a coordinating server and running the below command (from here) turn my N EC2s in a Ray cluster?
python coordinator_server.py --ips <list_of_node_ips> --port <PORT>

matthewdeng · December 30, 2021, 6:22pm

Ah I have yet to try that myself, but that seems reasonable. @Ameer_Haj_Ali can you confirm if using coordinator_server.py is a viable approach for creating a Ray cluster on an existing EC2 cluster?

hezhaozhao-git · March 11, 2022, 2:17am

it is silent when i useing ray train on ray train multi-node multi-gpu, example i useing 2 node → 2gpu,total 4gpu2, is is silent

hezhaozhao-git · March 11, 2022, 2:19am

(BaseWorkerMixin pid=20102, ip=10.0.100.15) 2022-03-11 01:47:32,207	INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25342, ip=10.0.100.14) 2022-03-11 01:47:32,327	INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25343, ip=10.0.100.14) 2022-03-11 01:47:32,369	INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25344, ip=10.0.100.14) 2022-03-11 01:47:32,457	INFO torch.py:247 – Wrapping provided model in DDP.
(BaseWorkerMixin pid=25345, ip=10.0.100.14) 2022-03-11 01:47:32,455	INFO torch.py:247 – Wrapping provided model in DDP.

this is log

hezhaozhao-git · March 11, 2022, 2:27am

i run get result: “py-spy dump --native --pid 25344”
Process 25344: ray::BaseWorkerMixin
Python v3.8.12 (/home/gpu4/anaconda3/bin/python3.8)

Thread 25344 (idle): “MainThread”
epoll_wait (libc-2.27.so)
boost::asio::detail::epoll_reactor::run (ray/_raylet.so)
boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
boost::asio::detail::scheduler::run (ray/_raylet.so)
boost::asio::io_context::run (ray/_raylet.so)
ray::core::CoreWorker::RunTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcess::RunTaskExecutionLoop (ray/_raylet.so)
run_task_loop (ray/_raylet.so)
main_loop (ray/worker.py:430)
(ray/workers/default_worker.py:224)
Thread 25458 (idle): “ray_import_thread”
do_futex_wait (libpthread-2.27.so)
__new_sem_wait_slow (libpthread-2.27.so)
PyThread_acquire_lock_timed (python3.8)
lock_PyThread_acquire_lock (python3.8)
wait (threading.py:306)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:733)
_poll_locked (ray/_private/gcs_pubsub.py:270)
poll (ray/_private/gcs_pubsub.py:409)
_run (ray/_private/import_thread.py:71)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
clone (libc-2.27.so)
Thread 2165 (idle): “Thread-41”
epoll_wait (libc-2.27.so)
0x7f6069f29bba (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069fd142c (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069cc5575 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069d28e47 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
0x7f6069d5a505 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
channel_spin (grpc/_channel.py:1258)
0x7f6069c8caf9 (grpc/_cython/cygrpc.cpython-38-x86_64-linux-gnu.so)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
clone (libc-2.27.so)
Please tell me how can I change it so that he can train normally? thank you.

matthewdeng · March 11, 2022, 2:38am

Can you create a new topic and post a reproduction script?

hezhaozhao-git · March 11, 2022, 2:59am

yes,i can ,but I have submitted a bug in the github issue,you can go there to check and communicate with more people,thank you.

ynadell · June 14, 2024, 9:16pm

Is there a way to do this now? The links that you have are broken now.

Topic		Replies	Views
Train PPO on Cluster with multiple nodes RLlib	2	44	January 24, 2025
Distributed pytorch on cluster Ray Clusters	4	541	June 9, 2021
How to run multi-GPU single node training with ray and PyTorch Lightning?	0	309	February 6, 2024
How to use ray for multi-node training? Ray Clusters	1	1601	November 22, 2022
Can you copy data between two node's CPU and GPU Kubernetes	5	730	July 12, 2021

How to launch multi-node job with Ray Train?

Related topics