[RaySGD] Communication Backend in RaySGD

Hi all ! We are using RaySGD + Pytorch. We construct our codes according to the given examples.

This is the construction of trainer

trainer = Trainer(
        backend="torch", num_workers=num_workers, use_gpu=use_gpu)
    trainer.start()
    print("trainer start")
    result = trainer.run(
        train_func=train_func,
        config={
            "lr": args.lr,
            "batch_size": args.batch_size,
            "epochs": args.epochs,
            "pid":pid
        },
        callbacks=[JsonLoggerCallback()])

And in the train_func, the training function is the same as the trainig function used in just use native torchDDP, which means there is no explicit call of ray’s function.

So, we’re quite intrested about what backend for data (gradients) communication is used. It’s nccl or ray to process the all reduce communication of data. Regarding our understanding, it is still nccl, which is resposible for the gradients exchange. Is our understanding correct? If so, will ray system optimize the nccl communication. Like actually there is some mechanism in ray will affect implicitly modify the communication rule or add syncrhonization in the nccl communication backend?

Hey @daxixi! Yes your understanding is correct.

Currently the communication for gradient synchronization happens all out of band and Ray is not involved in this. So for torch, this would be torch.distributed (either nccl or gloo).

I’m curious though, what type of optimizations are you looking for here/what Ray could provide on top of nccl?

Hi, @amogkam! Thanks your reply. I’m wondering about the difference of RaySGD and native DDP. Jsut regarding latency for training for one batch or one epoch, will RaySGD and native DDP have performance difference. Like RaySGD will work faster than native DDP or may have overhead, or actually this shall only depend on native DDP.

Cause, in-process store and distributed object store is a part I like very much in Ray. It can help manage data trasfer efficiently. And, I was guessing maybe Ray use object store to manage gradients among different DDP workers. So, my question is more like if there is any interaction between Ray with CUDA/kernel functions initially called by torch codes when doing DDP.