[SGD] [Tune] How about the performance of RaySGD compared with pytorch DDP?

Yeah, I think the main problem is like the optimizer that you create outside of the operator will have buffers that are based on CPU (rather than GPU).

A comment !
Shouldn’t we expect Ray SGD to have lower performance than torch.distributed as Ray incurs overhead (scheduling, updating system state, etc) ?

Is there any tool to measure the various metrics such as time spent on data transfer ?