Issue in iterative training of Tensorflow Model with Ray

High: It blocks me to complete my task.

I want to train a Tensorflow LSTM model iteratively on no of dataset one after other.

checkpoint_config = CheckpointConfig(checkpoint_score_attribute="accuracy", checkpoint_score_order="max")

batch_list = [0,1,2]

print(f"Starting Master batch_{batch_list[0]}")
X_train = np.random.choice([0, 1], size=(512,10,1016))
y_train = np.random.choice([0, 1], size=512)

dataset = ray.data.from_items([{"x": X_train[index,:,:], "y": y_train[index]} for index in range(y_train.shape[0])])
del X_train, y_train
print('''This cluster consists of {} nodes in total {} CPU resources in total'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

trainer = TensorflowTrainer(run_epoch,train_loop_config=config,
                            scaling_config=ScalingConfig(num_workers=2, trainer_resources={"CPU": 0}),
                            run_config=RunConfig(checkpoint_config=checkpoint_config),
                            datasets={"train": dataset})
print("Intialized trainer object ....")
result = trainer.fit()
del dataset
print(f"Completed Master batch_{batch_list[0]}")

print("****result_0:   ", result)

for batch in batch_list[1:]:
  print(f"Starting Master batch_{batch}")

  X_train = np.random.choice([0, 1], size=(512,10,1016))
  y_train = np.random.choice([0, 1], size=512)
  dataset = ray.data.from_items([{"x": X_train[index,:,:], "y": y_train[index]} for index in range(y_train.shape[0])])
  del X_train, y_train

  trainer = TensorflowTrainer(run_epoch,train_loop_config=config,
                              scaling_config=ScalingConfig(num_workers=2, trainer_resources={"CPU": 0}),
                              run_config=RunConfig(checkpoint_config=checkpoint_config),
                              resume_from_checkpoint = result.checkpoint,
                              datasets={"train": dataset})
  result = trainer.fit()

To do this I am passing checkpoints from previous training iteration to next iteration using resume_from_checkpoint parameter of trainer.fit() method. I referred to following documentation.
https://docs.ray.io/en/latest/train/dl_guide.html#loading-checkpoints

But I am getting error in 2nd iteration that the result Dict has no key accuracy.

ERROR checkpoint_manager.py:328 -- Result dict has no key: accuracy. checkpoint_score_attr must be set to a key in the result dict. Valid keys are: ['trial_id', 'experiment_id', 'date', 'timestamp', 'pid', 'hostname', 'node_ip', 'done']
Trial TensorflowTrainer_256fc_00000 completed. Last result:

even If the accuracy key is there in result Dict, which I have added in the checkpoint in my training function run_epoch

report_dict = {
                "epoch": [epoch+1],
                "loss": history.history['loss'][0],
                "accuracy": history.history['accuracy'][0]
            }
ckpt = Checkpoint.from_dict(dict(epoch=epoch,accuracy=history.history['accuracy'][0], model_weights=model.get_weights()))
session.report(report_dict, checkpoint=ckpt)

You can refer to my Notebook for better view and reproduce the error.

Please help me understand and resolve the issue and let me know if I am missing something.

Hi @suraj-gade,

Is it possible to provide your training function? I tried to reproduce the error with a dummy run_epoch function, but was not able to get the error that you are seeing.