I am trying to compare the LSTM model using ray SGD distributed tensorflow and not using Ray. I have 5 VMs connected to the cluster so 30 nodes. Data size is around 240k. LSTM is taking 279.84 time for 30epochs if not using ray. But after using Ray SGD with 4 replicas it is taking more time around 350sec. Even increasing the replicas is taking more time. What is the reason?
for point1: Replicas just create the replicated model, so TFTrainer should depends upon number of training dataset and 240k is I believe large dataset?
for point2: what is difference is number of workers and number of replicas? I have not increased the number of workers just increasing replicas so how it is increasing batch size?
number of workers == number of replicas. When you have 1 replica, it will consume a 100-sized batch. When you have 10 replicas, each of the replica will consume 100 – thus, it will consume 1000 sized batch.
the model is too small, resulting in heavy communication times in comparison to training times
(Sabya) - What do you think is a large sized model where we would see the benefit of training times over communication times?
the total batch size increases as you increase the number of workers. Thus, the actual training time per step will increase.
number of workers == number of replicas. When you have 1 replica, it will consume a 100-sized batch. When you have 10 replicas, each of the replica will consume 100 – thus, it will consume 1000 sized batch
(Sabya) - Thanks for this clarification, however since this 1000 sized batch is now distributed between 10 workers(i.e relplicas) should not the time per step either remain same or decrease?It might reduce if we increase the CPU/GPU per workers, however we have tried that and are seeing no benefit there too.
For example BERT will definitely show improvements for increasing the number of workers.
Thanks for this clarification, however since this 1000 sized batch is now distributed between 10 workers(i.e relplicas) should not the time per step either remain same or decrease?It might reduce if we increase the CPU/GPU per workers, however we have tried that and are seeing no benefit there too.
So you start with 1 worker with 100 batch size. Then, you have now 10 workers, totaling to batch size = 1000, but each worker is still operating on a local batch size of 100 while incurring an extra cost of communication overhead.
Thus, the right comparison is probably 1 worker with a batch size = 1000, compared to 10 workers with total batch size of 1000 (10 workers, local batch size of 100 each).
So the batch_size we define in TFTrainer config is for individual worker or total batch size divided by number of replicas. like if we give 10 replicas and 1000 batch_size so it means that each 10 replicas workers will get 1000 batch data (10*1000) or 1 replicas will get 100 batch_size data (1000/10)?
Another question:do epochs need to be distributed as well? Like without using TFTRainer SGD we used 100 epochs but if 10 epochs used then do we need to reduce epochs or should we give 100 epoch only as it will run parallel?