[RAYSGD] Regular pytorch program perform better than raysgd

JanJF · October 25, 2021, 11:37am

Hi, I’m new to ray and raysgd. And was following the example with train_linear_example — Ray v1.7.1
And from the callback function I can confirm that the two workers have same weight, so maybe the paralleled worked well,right?
Then I modify the program to use just normal pytorch ( non-parallel structure) to train the model, with same epoch and settings. I got lower loss value than raysgd.
I was wonder why is this happened and if I was misunderstand with raysgd.
Thanks for replying.

amogkam · October 25, 2021, 6:29pm

@JanJF just to make sure, are you using the same seed in both experiments? The same seed would have to be set for both the model initialization as well as for the data loading & shuffling, and in all the workers.

JanJF · October 26, 2021, 1:33am

Thanks for replaying!

I have set the seed and run several times. It seems like the non-paraller program use the whole dataset to train epochs while the raysgd when use 2 num_workers split the dataset to 2 part. So each work just train on half data, thus the loss is more high.

amogkam · October 26, 2021, 11:52pm

Hey @JanJF I created an issue on Github for this: [Train] Changing `num_workers` affects model loss · Issue #19767 · ray-project/ray · GitHub.

We can move the discussion there. The loss should be the same in both cases: 2 workers each training on half data should be equivalent to 1 worker training on the full data. In addition to setting the seed, you also have to make sure that the global batch size is the same for both. So in the 2 worker case, the per worker batch size should be half of the 1 worker case.

Topic		Replies	Views
[SGD] [Tune] How about the performance of RaySGD compared with pytorch DDP? Ray Tune	21	1775	April 22, 2021
Ray SGD distributed tensorflow Ray Train	8	718	December 17, 2020
Errors when test TorchTrainer with the "getting started" code Ray Train	1	525	October 1, 2021
Model output when trained multiple times Ray Train	11	553	March 22, 2023
Performance issue of back-propagation in using RaySGD Ray Tune	3	363	July 30, 2021

[RAYSGD] Regular pytorch program perform better than raysgd

Related topics