How to verify if training is happening in parallel in Ray?

Hi ,

I have a cluster setup , when parallel iterator is implemented or training is happening on ray. What are the different ways to verify if it is parallelly distributed across nodes.

There is ray dashboard, ray logging. What is the best way to see this happening inside ray , and to show it as an evidence to a client or a customer.



The Ray dashboard would be the best to confirm parallel training. You can also use the standard tools to check if multiple GPUs or CPUs are being utilized- either nvidia-smi or htop

One more question , does the training happen in synchronous or asynchronous fashion in ray?Any documentation link for the same?

Ray SGD is just a wrapper around Pytorch DistributedDataParallel, which uses synchronous SGD I believe.