Issue in iterative training of Tensorflow Model with Ray

suraj-gade · November 11, 2022, 7:13am

High: It blocks me to complete my task.

I want to train a Tensorflow LSTM model iteratively on no of dataset one after other.

checkpoint_config = CheckpointConfig(checkpoint_score_attribute="accuracy", checkpoint_score_order="max")

batch_list = [0,1,2]

print(f"Starting Master batch_{batch_list[0]}")
X_train = np.random.choice([0, 1], size=(512,10,1016))
y_train = np.random.choice([0, 1], size=512)

dataset = ray.data.from_items([{"x": X_train[index,:,:], "y": y_train[index]} for index in range(y_train.shape[0])])
del X_train, y_train
print('''This cluster consists of {} nodes in total {} CPU resources in total'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

trainer = TensorflowTrainer(run_epoch,train_loop_config=config,
                            scaling_config=ScalingConfig(num_workers=2, trainer_resources={"CPU": 0}),
                            run_config=RunConfig(checkpoint_config=checkpoint_config),
                            datasets={"train": dataset})
print("Intialized trainer object ....")
result = trainer.fit()
del dataset
print(f"Completed Master batch_{batch_list[0]}")

print("****result_0:   ", result)

for batch in batch_list[1:]:
  print(f"Starting Master batch_{batch}")

  X_train = np.random.choice([0, 1], size=(512,10,1016))
  y_train = np.random.choice([0, 1], size=512)
  dataset = ray.data.from_items([{"x": X_train[index,:,:], "y": y_train[index]} for index in range(y_train.shape[0])])
  del X_train, y_train

  trainer = TensorflowTrainer(run_epoch,train_loop_config=config,
                              scaling_config=ScalingConfig(num_workers=2, trainer_resources={"CPU": 0}),
                              run_config=RunConfig(checkpoint_config=checkpoint_config),
                              resume_from_checkpoint = result.checkpoint,
                              datasets={"train": dataset})
  result = trainer.fit()

To do this I am passing checkpoints from previous training iteration to next iteration using resume_from_checkpoint parameter of trainer.fit() method. I referred to following documentation.
https://docs.ray.io/en/latest/train/dl_guide.html#loading-checkpoints

But I am getting error in 2nd iteration that the result Dict has no key accuracy.

ERROR checkpoint_manager.py:328 -- Result dict has no key: accuracy. checkpoint_score_attr must be set to a key in the result dict. Valid keys are: ['trial_id', 'experiment_id', 'date', 'timestamp', 'pid', 'hostname', 'node_ip', 'done']
Trial TensorflowTrainer_256fc_00000 completed. Last result:

even If the accuracy key is there in result Dict, which I have added in the checkpoint in my training function run_epoch

report_dict = {
                "epoch": [epoch+1],
                "loss": history.history['loss'][0],
                "accuracy": history.history['accuracy'][0]
            }
ckpt = Checkpoint.from_dict(dict(epoch=epoch,accuracy=history.history['accuracy'][0], model_weights=model.get_weights()))
session.report(report_dict, checkpoint=ckpt)

You can refer to my Notebook for better view and reproduce the error.

Please help me understand and resolve the issue and let me know if I am missing something.

justinvyu · November 16, 2022, 5:50am

Hi @suraj-gade,

Is it possible to provide your training function? I tried to reproduce the error with a dummy run_epoch function, but was not able to get the error that you are seeing.

Topic		Replies	Views
Ray Trainer looking for more CPU's than that of its initialized on Ray Train	1	724	September 27, 2022
LSTM model is not getting trained on all the input batches using ray train Ray Train	6	739	September 19, 2022
Issue in Ray dataset sharding	12	1060	October 15, 2022
Improve and verify the performance of code on Ray Ray Core	0	303	March 3, 2021
Ray Tune 'RESOURCE_EXHAUSTED' error? Ray Tune	2	559	January 15, 2022

Issue in iterative training of Tensorflow Model with Ray

Related topics