[RAYSGD] Regular pytorch program perform better than raysgd

Hi, I’m new to ray and raysgd. And was following the example with train_linear_example — Ray v1.7.1
And from the callback function I can confirm that the two workers have same weight, so maybe the paralleled worked well,right?
Then I modify the program to use just normal pytorch ( non-parallel structure) to train the model, with same epoch and settings. I got lower loss value than raysgd.
I was wonder why is this happened and if I was misunderstand with raysgd.
Thanks for replying.

@JanJF just to make sure, are you using the same seed in both experiments? The same seed would have to be set for both the model initialization as well as for the data loading & shuffling, and in all the workers.

Thanks for replaying!

I have set the seed and run several times. It seems like the non-paraller program use the whole dataset to train epochs while the raysgd when use 2 num_workers split the dataset to 2 part. So each work just train on half data, thus the loss is more high.

Hey @JanJF I created an issue on Github for this: [Train] Changing `num_workers` affects model loss · Issue #19767 · ray-project/ray · GitHub.

We can move the discussion there. The loss should be the same in both cases: 2 workers each training on half data should be equivalent to 1 worker training on the full data. In addition to setting the seed, you also have to make sure that the global batch size is the same for both. So in the 2 worker case, the per worker batch size should be half of the 1 worker case.