How to verify if training is happening in parallel in Ray?

SumanthDatta · March 24, 2021, 6:09pm

Hi ,

I have a cluster setup , when parallel iterator is implemented or training is happening on ray. What are the different ways to verify if it is parallelly distributed across nodes.

There is ray dashboard, ray logging. What is the best way to see this happening inside ray , and to show it as an evidence to a client or a customer.

Thanks

Sumanth

amogkam · March 24, 2021, 6:13pm

The Ray dashboard would be the best to confirm parallel training. You can also use the standard tools to check if multiple GPUs or CPUs are being utilized- either nvidia-smi or htop

SumanthDatta · March 24, 2021, 6:15pm

One more question , does the training happen in synchronous or asynchronous fashion in ray?Any documentation link for the same?

amogkam · March 24, 2021, 6:29pm

Ray SGD is just a wrapper around Pytorch DistributedDataParallel, which uses synchronous SGD I believe.

Topic		Replies	Views
Synchronizing workers during ray train Ray Train	8	823	February 25, 2025
Is there a way not to do parallel processing when training the model of ray?	1	304	July 3, 2023
How to check training and validation distributed properly on the ray cluster Ray Train	2	842	August 26, 2022
Check Logging for parallel Iterator Ray Core	2	297	March 27, 2021
Model Parallelism in Ray Ray Train	9	2990	November 18, 2023

How to verify if training is happening in parallel in Ray?

Related topics