Ray SGD distributed tensorflow

rliaw · December 16, 2020, 10:46pm

I am trying to compare the LSTM model using ray SGD distributed tensorflow and not using Ray. I have 5 VMs connected to the cluster so 30 nodes. Data size is around 240k. LSTM is taking 279.84 time for 30epochs if not using ray. But after using Ray SGD with 4 replicas it is taking more time around 350sec. Even increasing the replicas is taking more time. What is the reason?

(From Divyank)

rliaw · December 16, 2020, 10:47pm

Divyank, how large is your model? and are you reducing the batch size as you increase the number of training workers?

divyankgarg · December 16, 2020, 10:52pm

It has 240k rows.

'batch_size': 100,
 'steps_per_epoch': 250
 'evaluate_config': 'steps': 60

Dense LSTM:

model = Sequential()
    model.add(LSTM(50, activation='relu', return_sequences=True,
                   input_shape=(10, 1)))
    model.add(LSTM(50, activation='relu'))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mse')

rliaw · December 16, 2020, 11:18pm

I suspect that this is happening because:

the model is too small, resulting in heavy communication times in comparison to training times
the total batch size increases as you increase the number of workers. Thus, the actual training time per step will increase.

divyankgarg · December 16, 2020, 11:37pm

for point1: Replicas just create the replicated model, so TFTrainer should depends upon number of training dataset and 240k is I believe large dataset?
for point2: what is difference is number of workers and number of replicas? I have not increased the number of workers just increasing replicas so how it is increasing batch size?

rliaw · December 17, 2020, 12:12am

number of workers == number of replicas. When you have 1 replica, it will consume a 100-sized batch. When you have 10 replicas, each of the replica will consume 100 – thus, it will consume 1000 sized batch.

Sabyas · December 17, 2020, 1:03am

the model is too small, resulting in heavy communication times in comparison to training times
(Sabya) - What do you think is a large sized model where we would see the benefit of training times over communication times?
the total batch size increases as you increase the number of workers. Thus, the actual training time per step will increase.
number of workers == number of replicas. When you have 1 replica, it will consume a 100-sized batch. When you have 10 replicas, each of the replica will consume 100 – thus, it will consume 1000 sized batch
(Sabya) - Thanks for this clarification, however since this 1000 sized batch is now distributed between 10 workers(i.e relplicas) should not the time per step either remain same or decrease?It might reduce if we increase the CPU/GPU per workers, however we have tried that and are seeing no benefit there too.

rliaw · December 17, 2020, 5:53pm

For example BERT will definitely show improvements for increasing the number of workers.

Thanks for this clarification, however since this 1000 sized batch is now distributed between 10 workers(i.e relplicas) should not the time per step either remain same or decrease?It might reduce if we increase the CPU/GPU per workers, however we have tried that and are seeing no benefit there too.

So you start with 1 worker with 100 batch size. Then, you have now 10 workers, totaling to batch size = 1000, but each worker is still operating on a local batch size of 100 while incurring an extra cost of communication overhead.

Thus, the right comparison is probably 1 worker with a batch size = 1000, compared to 10 workers with total batch size of 1000 (10 workers, local batch size of 100 each).

divyankgarg · December 17, 2020, 7:07pm

So the batch_size we define in TFTrainer config is for individual worker or total batch size divided by number of replicas. like if we give 10 replicas and 1000 batch_size so it means that each 10 replicas workers will get 1000 batch data (10*1000) or 1 replicas will get 100 batch_size data (1000/10)?
Another question:do epochs need to be distributed as well? Like without using TFTRainer SGD we used 100 epochs but if 10 epochs used then do we need to reduce epochs or should we give 100 epoch only as it will run parallel?

Topic		Replies	Views
LSTM model is not getting trained on all the input batches using ray train Ray Train	6	744	September 19, 2022
Increase in workers doesn't decrease training time Ray Train	9	1173	June 8, 2022
Tensorflowtrainer train way slower than (normal pandas and tensorflow) Ray Train	1	580	April 12, 2023
Ray train tensorflowtrainer look slower than than (normal pandas and tensorflow) i.e without using distribution training or any framework Ray Train	2	690	April 13, 2023
Issue in iterative training of Tensorflow Model with Ray Ray Train	1	398	November 16, 2022

Ray SGD distributed tensorflow

Related topics