[RaySGD] Communication Backend in RaySGD

daxixi · December 6, 2021, 8:58am

Hi all ! We are using RaySGD + Pytorch. We construct our codes according to the given examples.

This is the construction of trainer

trainer = Trainer(
        backend="torch", num_workers=num_workers, use_gpu=use_gpu)
    trainer.start()
    print("trainer start")
    result = trainer.run(
        train_func=train_func,
        config={
            "lr": args.lr,
            "batch_size": args.batch_size,
            "epochs": args.epochs,
            "pid":pid
        },
        callbacks=[JsonLoggerCallback()])

And in the train_func， the training function is the same as the trainig function used in just use native torchDDP, which means there is no explicit call of ray’s function.

So, we’re quite intrested about what backend for data (gradients) communication is used. It’s nccl or ray to process the all reduce communication of data. Regarding our understanding, it is still nccl, which is resposible for the gradients exchange. Is our understanding correct? If so, will ray system optimize the nccl communication. Like actually there is some mechanism in ray will affect implicitly modify the communication rule or add syncrhonization in the nccl communication backend?

amogkam · December 6, 2021, 6:17pm

Hey @daxixi! Yes your understanding is correct.

Currently the communication for gradient synchronization happens all out of band and Ray is not involved in this. So for torch, this would be torch.distributed (either nccl or gloo).

I’m curious though, what type of optimizations are you looking for here/what Ray could provide on top of nccl?

daxixi · December 7, 2021, 7:57am

Hi, @amogkam! Thanks your reply. I’m wondering about the difference of RaySGD and native DDP. Jsut regarding latency for training for one batch or one epoch, will RaySGD and native DDP have performance difference. Like RaySGD will work faster than native DDP or may have overhead, or actually this shall only depend on native DDP.

Cause, in-process store and distributed object store is a part I like very much in Ray. It can help manage data trasfer efficiently. And, I was guessing maybe Ray use object store to manage gradients among different DDP workers. So, my question is more like if there is any interaction between Ray with CUDA/kernel functions initially called by torch codes when doing DDP.

Topic		Replies	Views
Torch DDP backend performance issues Ray Tune	0	566	April 5, 2022
[SGD] [Tune] How about the performance of RaySGD compared with pytorch DDP? Ray Tune	21	1756	April 22, 2021
Performance issue of back-propagation in using RaySGD Ray Tune	3	361	July 30, 2021
Several questions about DL training (e.g. alexnet with pytorch) Ray Core	2	323	July 12, 2021
How does RaySGD work on top of torch.dist.launch? Ray Tune	3	510	June 16, 2021

[RaySGD] Communication Backend in RaySGD

Related topics