Hi all ! We are using RaySGD + Pytorch. We construct our codes according to the given examples.
This is the construction of trainer
trainer = Trainer(
backend="torch", num_workers=num_workers, use_gpu=use_gpu)
trainer.start()
print("trainer start")
result = trainer.run(
train_func=train_func,
config={
"lr": args.lr,
"batch_size": args.batch_size,
"epochs": args.epochs,
"pid":pid
},
callbacks=[JsonLoggerCallback()])
And in the train_func, the training function is the same as the trainig function used in just use native torchDDP, which means there is no explicit call of ray’s function.
So, we’re quite intrested about what backend for data (gradients) communication is used. It’s nccl or ray to process the all reduce communication of data. Regarding our understanding, it is still nccl, which is resposible for the gradients exchange. Is our understanding correct? If so, will ray system optimize the nccl communication. Like actually there is some mechanism in ray will affect implicitly modify the communication rule or add syncrhonization in the nccl communication backend?