High: It blocks me to complete my task.
I am trying to train a Tensorflow model using Ray train using. My Input data has 24 batches ( 1536 samples and I have kept batch_size = 64) and running 2 epochs. Refer to the code below.
(Here I am using 2 CPUs for ray distributed training)
def train_func(config):
CONF, data_path = config.values()
X_train, y_train = read_input_data(data_path)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
with strategy.scope():
model = lstm_model() # returns a sequential model
start_epoch = 0
checkpoint = train.load_checkpoint()
if checkpoint:
model.set_weights(checkpoint.get("model_weights"))
start_epoch = checkpoint.get("epoch", -1) + 1
model.compile()
epochs = 2
for epoch in range(start_epoch, epochs):
history = model.fit(X_train, y_train,batch_size=64)
train.save_checkpoint(epoch=epoch, accuracy=history.history['acc'][0],
model_weights=model.get_weights())
config = CONF ## model parameters
trainer = Trainer(backend="tensorflow", num_workers=num_cpus, logdir=os.getcwd())
trainer.start()
checkpoint_strategy = CheckpointStrategy(num_to_keep=1, checkpoint_score_attribute="accuracy", checkpoint_score_order="max")
trainer.run(train_func, config=config, callbacks=[PrintingCallback()], checkpoint_strategy=checkpoint_strategy,
checkpoint=trainer.latest_checkpoint)
trainer.shutdown()
From the logs you could see that the model is getting trained only on few of the batches (out of 24). As well as each time, model gets trained on random batches and different no. of batches. Please refer below screenshot of logs.
Logs of 1st and 2nd epochs:
In above screenshot you can see that, in 1st epoch, model is getting trained on 5th,12th,16th,24th batches only and in 2nd epoch on batches [1,9,16,22,24].
Can you please help me understand, if my assumption is correct that some batches are getting skipped while training. If yes, why this is happening?
Also, please let me know I am missing something.