Can't save Checkpoint wenn using Tensorflow and PBT

Hallo everyone, I was using ray tune PBT to tune my model, but I can’t find the saved model in checkpoints Dokumente. At the end of traing I can get the best config but I can’t get the best model. And i will get error like this:

(pid=27000) 2021-01-11 17:16:43.197133: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:109
: Not found: Failed to create a NewWriteableFile: D:\probe\pbt_checkpoint\pbt_test\MLPmodel_39793_00000_0_af_0=0,af_1=2,af_2=1,af_3=0,af_4=2,af_5=2,af_6=2,af_7=1,af_output=3,batchsize=849,num_layers=5,units_0=498,_2021-01-11_17-15-41\variables\variables_temp_62b4641374534df4bd63c5ecfd5991b3/part-00000-of-00001.data-00000-of-00001.tempstate15293471527192116649 : ϵͳ�Ҳ���ָ����·����
(pid=27000) ; No such process
2021-01-11 17:16:43,978 ERROR worker.py:980 – Possible unhandled error from worker: ray::MLPmodel.save_to_object() (pid=27000, ip=172.16.1.32)
File “python\ray_raylet.pyx”, line 463, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 415, in ray._raylet.execute_task.function_executor
File “D:\anaconda3\envs\BA_37\lib\site-packages\ray\function_manager.py”, line 556, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “D:\anaconda3\envs\BA_37\lib\site-packages\ray\tune\trainable.py”, line 295, in save_to_object
checkpoint_path = self.save(tmpdir)
File “D:\anaconda3\envs\BA_37\lib\site-packages\ray\tune\trainable.py”, line 278, in save
checkpoint = self.save_checkpoint(checkpoint_dir)
File “d:/Probe/PBT/PBT_probe.py”, line 94, in save_checkpoint
self.model.save(file_path)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\keras\engine\training.py”, line 1979, in save
signatures, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\keras\saving\save.py”, line 134, in save_model
signatures, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\keras\saving\saved_model\save.py”, line 80, in save
save_lib.save(model, filepath, signatures, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\saved_model\save.py”, line 985, in save
options=ckpt_options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\tracking\util.py”, line 1200, in save
file_prefix_tensor, object_graph_tensor, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\tracking\util.py”, line 1145, in _save_cached_when_graph_building
save_op = saver.save(file_prefix, options=options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\saving\functional_saver.py”, line 295, in save
return save_fn()
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\saving\functional_saver.py”, line 269, in save_fn
sharded_saves.append(saver.save(shard_prefix, options))
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\saving\functional_saver.py”, line 78, in save
return io_ops.save_v2(file_prefix, tensor_names, tensor_slices, tensors)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\ops\gen_io_ops.py”, line 1731, in save_v2
ctx=_ctx)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\ops\gen_io_ops.py”, line 1751, in save_v2_eager_fallback
ctx=ctx, name=name)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\eager\execute.py”, line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a directory: D:\probe\pbt_checkpoint\pbt_test\MLPmodel_39793_00000_0_af_0=0,af_1=2,af_2=1,af_3=0,af_4=2,af_5=2,af_6=2,af_7=1,af_output=3,batchsize=849,num_layers=5,units_0=498,_2021-01-11_17-15-41\tmpulblmjabsave_to_object\checkpoint_2/model\variables\variables_temp_c123b60a74554ae3b4da1f882fa4089b; No such file or directory [Op:SaveV2]
2021-01-11 17:16:49,136 ERROR worker.py:980 – Possible unhandled error from worker: ray::MLPmodel.stop() (pid=27000, ip=172.16.1.32)
File “python\ray_raylet.pyx”, line 463, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 415, in ray._raylet.execute_task.function_executor
File “D:\anaconda3\envs\BA_37\lib\site-packages\ray\function_manager.py”, line 556, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “D:\anaconda3\envs\BA_37\lib\site-packages\ray\tune\trainable.py”, line 512, in stop
self.cleanup()
File “d:/Probe/PBT/PBT_probe.py”, line 102, in cleanup
saved_path = self.model.save(self.logdir)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\keras\engine\training.py”, line 1979, in save
signatures, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\keras\saving\save.py”, line 134, in save_model
signatures, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\keras\saving\saved_model\save.py”, line 80, in save
save_lib.save(model, filepath, signatures, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\saved_model\save.py”, line 985, in save
options=ckpt_options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\tracking\util.py”, line 1200, in save
file_prefix_tensor, object_graph_tensor, options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\tracking\util.py”, line 1145, in _save_cached_when_graph_building
save_op = saver.save(file_prefix, options=options)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\saving\functional_saver.py”, line 295, in save
return save_fn()
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\saving\functional_saver.py”, line 269, in save_fn
sharded_saves.append(saver.save(shard_prefix, options))
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\training\saving\functional_saver.py”, line 78, in save
return io_ops.save_v2(file_prefix, tensor_names, tensor_slices, tensors)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\ops\gen_io_ops.py”, line 1731, in save_v2
ctx=_ctx)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\ops\gen_io_ops.py”, line 1751, in save_v2_eager_fallback
ctx=ctx, name=name)
File “D:\anaconda3\envs\BA_37\lib\site-packages\tensorflow\python\eager\execute.py”, line 60, in quick_execute
inputs, attrs, num_outputs)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd5 in position 355: invalid continuation byte

and I have used this code to save checkpoint

def save_checkpoint(self, checkpoint_dir):

    file_path = checkpoint_dir + "/model"

    self.model.save(file_path)

    return file_path

def load_checkpoint(self, path):

    del self.model

    self.model = load_model(path)

Hier is my ray.run function:

pbt = PopulationBasedTraining(

    time_attr="training_iteration",

    perturbation_interval=2,

    hyperparam_mutations=mutationspace)

results = tune.run(

    MLPmodel,

    name="pbt_test",

    local_dir=os.path.normpath('D:/probe/pbt_checkpoint/'),

    scheduler=pbt,

    metric="msle",

    mode="min",

    reuse_actors=True,

    resources_per_trial={

        "cpu": 3,

        "gpu": 1

    },

    stop={"training_iteration": 4},

    num_samples=2,

    config=searchspace,

    )

Any suggestions why this might happend and how to fix it?

Thank you

You can add for eaxample lines:

results = tune.run(
keep_checkpoints_num=3,
checkpoint_freq=3,
checkpoint_at_end=True,

keep_checkpoints_num - save the last 3 models (earlier models are automatically deleted)
checkpoint_freq - save model for every 3 iterations
checkpoint_at_end - keep the last checkpoint

look for in the ‘ray_results’ the directories with namedes: ‘checkpoint_x’ - where x is iteration number

Peter

Hi Peter,
Thanks for your advice! I have followed your advice, but I can’t see the model under ‘checkpoint_x/model’, and I got the error log as follows:

ray.exceptions.RayTaskError(NotFoundError): e[36mray::MLPmodel.save()e[39m (pid=26136, ip=172.16.1.32)
tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a directory: D:\probe\pbt_checkpoint\pbt_test\MLPmodel_9a61d_00001_1_af_0=1,af_1=2,af_2=2,af_3=2,af_4=0,af_5=1,af_6=2,af_7=1,af_output=3,batchsize=1006,num_layers=4,units_0=62,_2021-01-11_20-10-53\checkpoint_3/model\variables\variables_temp_fc9757f57fd84106aebf0d0346f99f22; No such file or directory [Op:SaveV2]
maybe the name for checkpoint is too long for tensorflow to save, so I can’t save the model, do you have any idea?

This is checkpoint directory. Simple code if you want to use PPO trainer on typical gym environment, (it’s not perfect but works):

analysis = tune.run(
run_or_experiment=“PPO”, # check if your environment is continous or discreete before choosing training algorithm
scheduler=asha_scheduler,
keep_checkpoints_num=3,
checkpoint_freq=3,
checkpoint_at_end=True,
stop={“episode_reward_mean”: 300}, # stop training if this value is reached episode_reward_mean
mode=‘max’, # find maximum vale as a target
reuse_actors=True,
config=config,
verbose=3, #0 = silent, 1 = only status updates, 2 = status and brief trial results, 3 = status and detailed trial results. Defaults to 3
)

checkpoints = analysis.get_trial_checkpoints_paths(
trial=analysis.get_best_trial(“episode_reward_mean”),
metric=“episode_reward_mean”)

print(‘checkpoints=’, checkpoints)
checkpoint_path, reward = checkpoints[0]
print(‘checkpoint_path=’, checkpoint_path)

config = {
“env”: “CartPole-v0”,
“num_gpus”: 0,
“num_workers”: 1,
“framework”: “tf2”,
}

agent = ppo.PPOTrainer(config=config, env=“CartPole-v0”)
agent.restore(checkpoint_path)

print(‘agent=’, agent)

############## TYPICAL GYM ENV ##########################
import gym
env = gym.make(“CartPole-v0”)

episode_reward = 0
done = False
obs = env.reset()
while not done:
action = agent.compute_action(obs)
obs, reward, done, info = env.step(action)
episode_reward += reward
print(‘episode_reward=’, episode_reward)

This path works for me:
/home/peterpirog/PycharmProjects/Ray_tests/ray_results/PPO/PPO_BipedalWalkerHardcore-v3_eba13_00000_0_2021-01-10_16-43-09/checkpoint_1326

Maybe check if paths are correct:

Hallo Peter, thanks for your advice. I am using Ray tune to tune the number of hidden layer of a MLP Network, hier is my script to tune the model, could you please check this script?

GitHub - LiuDaniu1997/tune-with-ray

Is it possible to tune the number of hidden layer with PBT or any other methode in Ray?