Hey guys,
Now I know RaySGD is a wrapper built on top of pytorch. Quote: “”" The TorchTrainer
is a wrapper around torch.distributed.launch
with a Python API “”" (ray/raysgd_pytorch.rst at 9a93dd9682a216d2028db8edb60ff1485f653721 · ray-project/ray · GitHub)
I was trying to fully understand how RaySGD does the wrapping. Digged a bit into the code (trainer and worker_group part) but couldn’t tell for sure. ‘torch.distributed.launch’ is a script, without providing a python API interface. Does that mean RaySGD under the hood calls ‘torch.distributed.launch’ via shell script still, just with all the hard works taken care of ?
Guess in short I am curious where the entry point is, a pointer to the entry point of code is much appreciated.
@HuangLED ah yeah the description is a little misleading. RaySGD doesn’t actually call torch.distributed.launch
. It runs the various training processes as Ray actors. The entrypoint is via the TorchTrainer
API.
1 Like
I guess there must be some sort of coordination service that control all the parallel training process.
AFAIK, pytorch dist module uses either etcd, or built-in c10d backend. Is RaySGD rely on either of these? or Ray uses it own service to coordinate all the parellel services?
So there’s 2 levels here. Ray is used to launch the processes (as Ray actors), and it handles things like placement, proper resource allocation, and scheduling. Fault-tolerant and elastic training are implemented at this level.
RaySGD still uses torch.distributed under the hood to actually handle the communication among the training processes. A process group is created and each training worker is initialized with this process group.
So the torch.distributed module is still being used, but the torch.distributed.launch utility script is now being replaced by Ray.
1 Like