Hi ,
I have a cluster setup , when parallel iterator is implemented or training is happening on ray. What are the different ways to verify if it is parallelly distributed across nodes.
There is ray dashboard, ray logging. What is the best way to see this happening inside ray , and to show it as an evidence to a client or a customer.
Thanks
Sumanth
The Ray dashboard would be the best to confirm parallel training. You can also use the standard tools to check if multiple GPUs or CPUs are being utilized- either nvidia-smi
or htop
One more question , does the training happen in synchronous or asynchronous fashion in ray?Any documentation link for the same?
Ray SGD is just a wrapper around Pytorch DistributedDataParallel, which uses synchronous SGD I believe.