How does RaySGD work on top of torch.dist.launch?

HuangLED · June 11, 2021, 9:10pm

Hey guys,

Now I know RaySGD is a wrapper built on top of pytorch. Quote: “”" The TorchTrainer is a wrapper around torch.distributed.launch with a Python API “”" (ray/raysgd_pytorch.rst at 9a93dd9682a216d2028db8edb60ff1485f653721 · ray-project/ray · GitHub)

I was trying to fully understand how RaySGD does the wrapping. Digged a bit into the code (trainer and worker_group part) but couldn’t tell for sure. ‘torch.distributed.launch’ is a script, without providing a python API interface. Does that mean RaySGD under the hood calls ‘torch.distributed.launch’ via shell script still, just with all the hard works taken care of ?

Guess in short I am curious where the entry point is, a pointer to the entry point of code is much appreciated.

amogkam · June 11, 2021, 9:19pm

@HuangLED ah yeah the description is a little misleading. RaySGD doesn’t actually call torch.distributed.launch. It runs the various training processes as Ray actors. The entrypoint is via the TorchTrainer API.

HuangLED · June 14, 2021, 5:18am

I guess there must be some sort of coordination service that control all the parallel training process.

AFAIK, pytorch dist module uses either etcd, or built-in c10d backend. Is RaySGD rely on either of these? or Ray uses it own service to coordinate all the parellel services?

amogkam · June 16, 2021, 7:55pm

So there’s 2 levels here. Ray is used to launch the processes (as Ray actors), and it handles things like placement, proper resource allocation, and scheduling. Fault-tolerant and elastic training are implemented at this level.

RaySGD still uses torch.distributed under the hood to actually handle the communication among the training processes. A process group is created and each training worker is initialized with this process group.

So the torch.distributed module is still being used, but the torch.distributed.launch utility script is now being replaced by Ray.

Topic		Replies	Views
[RaySGD] Communication Backend in RaySGD Ray Train	2	541	December 7, 2021
Errors when test TorchTrainer with the "getting started" code Ray Train	1	524	October 1, 2021
[SGD] [Tune] How about the performance of RaySGD compared with pytorch DDP? Ray Tune	21	1742	April 22, 2021
Distributed torch model training with Ray Core APIs Ray Core	3	491	November 3, 2023
Distributed training in PyTorch and init_process_group Ray Tune	12	3877	September 7, 2021

How does RaySGD work on top of torch.dist.launch?

Related topics