Yeah, I think the main problem is like the optimizer that you create outside of the operator will have buffers that are based on CPU (rather than GPU).
A comment !
Shouldn’t we expect Ray SGD to have lower performance than torch.distributed as Ray incurs overhead (scheduling, updating system state, etc) ?
Is there any tool to measure the various metrics such as time spent on data transfer ?