Restore checkpoint saved with client-server

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’ve run a large scale training run using client-server with an external environment. I’m a little old-school, so I just created the trainer directly and saved checkpoints in the train loop instead of using tune:

    trainer = A2CTrainer(config=config)

    # Serving and training loop.
    i = 0
    while True:
        results = trainer.train()
        trainer.export_policy_checkpoint(
            f"checkpoint_directory/checkpoint{i}",
            policy_id="3"
        )
        i += 1

Each checkpoint directory looks like this:

ls -al
total 2008
drwx--S---  2 rusu1   4096 Jul 26 20:37 ./
drwx--S--- 48 rusu1 249856 Jul 26 20:37 ../
-rw-------  1 rusu1    231 Jul 26 20:37 checkpoint
-rw-------  1 rusu1 814172 Jul 26 20:37 model.data-00000-of-00001
-rw-------  1 rusu1   1084 Jul 26 20:37 model.index
-rw-------  1 rusu1 963545 Jul 26 20:37 model.meta

This ran for about 12 hours, and I am very pleased with the training results. Now I want to do some post-processing, including running a few episodes so I can visualize what the agent learned. I ran this overnight and the compute nodes I had open have since closed, so I no longer have the server process running. This shouldn’t be an issue since I have the checkpoints saved to disk. Following this inference example, I have the following in my post-processing script:

sim = MyExternalEnvSim()

ray.init()
trainer = A2CTrainer(config=config)

checkpoint = 'checkpoint_directory/checkpoint_45/model'
trainer.restore(checkpoint)

# Inference loop below
...

The external sim launches well and the trainer is created without an issue. However, when it attempts to restore from the checkpoint, I get the following error:

Traceback (most recent call last):
  File "my_script.py", line 127, in <module>
    trainer.restore(checkpoint)
  File "/venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 515, in restore
    with open(checkpoint_path + ".tune_metadata", "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoint_directory/checkpoint_45/model.tune_metadata'

So I guess that restore automatically looks for a .tune_metadata file. However, I didn’t run with tune, so I don’t have one of those. Is there a way to load the checkpointed files in the format that I have them?

Hi @rusu24edward,

There should be a trainer.load_checkpoint method that may work for you.

Thanks for the suggestion. It looks like load_checkpoint attempts to read a pickled state file such as would be produced from something like trainer.save_checkpoint(). None of my files are pickled state files. They appear to be the output of tf1.train.Saver().save().

I dug around a little more and it looks like there is an import_policy_model_from_h5 function that may be the complement of export_policy_checkpoint. I’ll see if I can make any progress with that…

You know, it is really easy to run with tune. Based on your small snippet it would be something like

from ray import tune
tune.run("A2C", 
    config=config, 
    checkpoint_freq=1,
    keep_checkpoints_num=N,
    checkpoint_at_end=True,
    stop = { #"timesteps_total": NUMBER_OF_TIMESTEPS_TO_SAMPLE,
             "training_iteration": NUMBER_OF_TRAINING_ITERATIONS
           }
    local_dir=PATH_TO_WHERE_YOU_WANT_YOUR_RESULTS
)

I usually do run with tune, but this is a new effort and I went with the python api instead. I don’t want to redo my 12 hour run, so if there’s a way to load the checkpoint, that would be best.

For sure, future runs I’ll switch to tune for ease of use.

Sorry for over explaining then. I misinterpreted that quote to mean you had not run with tune before.

All good! I appreciate your suggestions

I wasn’t able to find a way to import the policy, so I had to rerun my training from scratch. Definitely used tune this time. I’m not sure how I got it in my mind to use export_policy_checkpoint. It would be nice if all manner of exporting/saving came with support for importing/reading. I know that most do, but I just wasn’t able to find anything for export_policy_checkpoint.