High: It blocks me to complete my task.
I want to train a Tensorflow LSTM model iteratively on no of dataset one after other.
checkpoint_config = CheckpointConfig(checkpoint_score_attribute="accuracy", checkpoint_score_order="max")
batch_list = [0,1,2]
print(f"Starting Master batch_{batch_list[0]}")
X_train = np.random.choice([0, 1], size=(512,10,1016))
y_train = np.random.choice([0, 1], size=512)
dataset = ray.data.from_items([{"x": X_train[index,:,:], "y": y_train[index]} for index in range(y_train.shape[0])])
del X_train, y_train
print('''This cluster consists of {} nodes in total {} CPU resources in total'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
trainer = TensorflowTrainer(run_epoch,train_loop_config=config,
scaling_config=ScalingConfig(num_workers=2, trainer_resources={"CPU": 0}),
run_config=RunConfig(checkpoint_config=checkpoint_config),
datasets={"train": dataset})
print("Intialized trainer object ....")
result = trainer.fit()
del dataset
print(f"Completed Master batch_{batch_list[0]}")
print("****result_0: ", result)
for batch in batch_list[1:]:
print(f"Starting Master batch_{batch}")
X_train = np.random.choice([0, 1], size=(512,10,1016))
y_train = np.random.choice([0, 1], size=512)
dataset = ray.data.from_items([{"x": X_train[index,:,:], "y": y_train[index]} for index in range(y_train.shape[0])])
del X_train, y_train
trainer = TensorflowTrainer(run_epoch,train_loop_config=config,
scaling_config=ScalingConfig(num_workers=2, trainer_resources={"CPU": 0}),
run_config=RunConfig(checkpoint_config=checkpoint_config),
resume_from_checkpoint = result.checkpoint,
datasets={"train": dataset})
result = trainer.fit()
To do this I am passing checkpoints from previous training iteration to next iteration using resume_from_checkpoint
parameter of trainer.fit() method. I referred to following documentation.
https://docs.ray.io/en/latest/train/dl_guide.html#loading-checkpoints
But I am getting error in 2nd iteration that the result Dict has no key accuracy.
ERROR checkpoint_manager.py:328 -- Result dict has no key: accuracy. checkpoint_score_attr must be set to a key in the result dict. Valid keys are: ['trial_id', 'experiment_id', 'date', 'timestamp', 'pid', 'hostname', 'node_ip', 'done']
Trial TensorflowTrainer_256fc_00000 completed. Last result:
even If the accuracy key is there in result Dict, which I have added in the checkpoint in my training function run_epoch
report_dict = {
"epoch": [epoch+1],
"loss": history.history['loss'][0],
"accuracy": history.history['accuracy'][0]
}
ckpt = Checkpoint.from_dict(dict(epoch=epoch,accuracy=history.history['accuracy'][0], model_weights=model.get_weights()))
session.report(report_dict, checkpoint=ckpt)
You can refer to my Notebook for better view and reproduce the error.
Please help me understand and resolve the issue and let me know if I am missing something.