For example BERT will definitely show improvements for increasing the number of workers.
Thanks for this clarification, however since this 1000 sized batch is now distributed between 10 workers(i.e relplicas) should not the time per step either remain same or decrease?It might reduce if we increase the CPU/GPU per workers, however we have tried that and are seeing no benefit there too.
So you start with 1 worker with 100 batch size. Then, you have now 10 workers, totaling to batch size = 1000, but each worker is still operating on a local batch size of 100 while incurring an extra cost of communication overhead.
Thus, the right comparison is probably 1 worker with a batch size = 1000, compared to 10 workers with total batch size of 1000 (10 workers, local batch size of 100 each).