How to create checkpoints

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

#!/usr/bin/env python3
from ray.rllib.agents import dqn
from env import AnimalTower, udid_list
from ray.tune.registry import register_env

def env_creator(env_config):
env = AnimalTower()
return env
register_env(“my_env”, env_creator)
trainer = dqn.R2D2Trainer(env=“my_env”, config={
“framework”: “tf”,
# R2D2 settings.
“num_workers”: 3,
“compress_observations”: True,
“exploration_config”: {
“epsilon_timesteps”: 40
},
“target_network_update_freq”: 10
“model”: {
“use_lstm”: True
},
“timesteps_per_iteration”: 1
})

for i in range(10000):
print(trainer.train())
if i % 100 == 0:
checkpoint = trainer.save()
print(“checkpoint saved at”, checkpoint)

Training does not finish, checkpoints are not created in ~/ray_results
Why?

Hi @kuu-dtb-rl!

I’m quite new working with RLlib, so I hope I understood your issue right. I could not reproduce your error as you did not provide your custom environment. However, I suggest you use ray.tune.run() for training instead of dqn.R2D2Trainer() like:

ray.tune.run(
    'R2D2',
    stop={
        'training_iteration': 10000,
    },
    config={
        'env':'my_env',
        'framework': 'tf',
        # R2D2 settings.
        'num_workers': 3,
        'compress_observations': True,
        'exploration_config': {'epsilon_timesteps': 40},
        'target_network_update_freq': 10,
        'model': {'use_lstm': True},
        'timesteps_per_iteration': 1
    },
    checkpoint_freq=100,
    checkpoint_at_end=True,
    local_dir='checkpoints',
)

The checkpoint should be created in /checkpoints.

Again, I could not test it with your custom env but hopefully it will help you.

1 Like

I believe you need to explicitly write the checkpoint object:

checkpoint = trainer.save()
with open(some_checkpoint_path, "w") as fp:
    fp.write(checkpoint)