Restore checkpoint saved with client-server

rusu24edward · July 27, 2022, 4:59pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’ve run a large scale training run using client-server with an external environment. I’m a little old-school, so I just created the trainer directly and saved checkpoints in the train loop instead of using tune:

    trainer = A2CTrainer(config=config)

    # Serving and training loop.
    i = 0
    while True:
        results = trainer.train()
        trainer.export_policy_checkpoint(
            f"checkpoint_directory/checkpoint{i}",
            policy_id="3"
        )
        i += 1

Each checkpoint directory looks like this:

ls -al
total 2008
drwx--S---  2 rusu1   4096 Jul 26 20:37 ./
drwx--S--- 48 rusu1 249856 Jul 26 20:37 ../
-rw-------  1 rusu1    231 Jul 26 20:37 checkpoint
-rw-------  1 rusu1 814172 Jul 26 20:37 model.data-00000-of-00001
-rw-------  1 rusu1   1084 Jul 26 20:37 model.index
-rw-------  1 rusu1 963545 Jul 26 20:37 model.meta

This ran for about 12 hours, and I am very pleased with the training results. Now I want to do some post-processing, including running a few episodes so I can visualize what the agent learned. I ran this overnight and the compute nodes I had open have since closed, so I no longer have the server process running. This shouldn’t be an issue since I have the checkpoints saved to disk. Following this inference example, I have the following in my post-processing script:

sim = MyExternalEnvSim()

ray.init()
trainer = A2CTrainer(config=config)

checkpoint = 'checkpoint_directory/checkpoint_45/model'
trainer.restore(checkpoint)

# Inference loop below
...

The external sim launches well and the trainer is created without an issue. However, when it attempts to restore from the checkpoint, I get the following error:

Traceback (most recent call last):
  File "my_script.py", line 127, in <module>
    trainer.restore(checkpoint)
  File "/venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 515, in restore
    with open(checkpoint_path + ".tune_metadata", "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoint_directory/checkpoint_45/model.tune_metadata'

So I guess that restore automatically looks for a .tune_metadata file. However, I didn’t run with tune, so I don’t have one of those. Is there a way to load the checkpointed files in the format that I have them?

mannyv · July 27, 2022, 5:49pm

Hi @rusu24edward,

There should be a trainer.load_checkpoint method that may work for you.

rusu24edward · July 27, 2022, 7:56pm

Thanks for the suggestion. It looks like load_checkpoint attempts to read a pickled state file such as would be produced from something like trainer.save_checkpoint(). None of my files are pickled state files. They appear to be the output of tf1.train.Saver().save().

I dug around a little more and it looks like there is an import_policy_model_from_h5 function that may be the complement of export_policy_checkpoint. I’ll see if I can make any progress with that…

mannyv · July 27, 2022, 9:14pm

You know, it is really easy to run with tune. Based on your small snippet it would be something like

from ray import tune
tune.run("A2C", 
    config=config, 
    checkpoint_freq=1,
    keep_checkpoints_num=N,
    checkpoint_at_end=True,
    stop = { #"timesteps_total": NUMBER_OF_TIMESTEPS_TO_SAMPLE,
             "training_iteration": NUMBER_OF_TRAINING_ITERATIONS
           }
    local_dir=PATH_TO_WHERE_YOU_WANT_YOUR_RESULTS
)

rusu24edward · July 28, 2022, 12:48am

I usually do run with tune, but this is a new effort and I went with the python api instead. I don’t want to redo my 12 hour run, so if there’s a way to load the checkpoint, that would be best.

For sure, future runs I’ll switch to tune for ease of use.

mannyv · July 28, 2022, 1:04am

Sorry for over explaining then. I misinterpreted that quote to mean you had not run with tune before.

rusu24edward · July 28, 2022, 4:20pm

All good! I appreciate your suggestions

rusu24edward · August 2, 2022, 1:28am

I wasn’t able to find a way to import the policy, so I had to rerun my training from scratch. Definitely used tune this time. I’m not sure how I got it in my mind to use export_policy_checkpoint. It would be nice if all manner of exporting/saving came with support for importing/reading. I know that most do, but I just wasn’t able to find anything for export_policy_checkpoint.

Topic		Replies	Views
[Rllib] how to restore trainer from different checkpoint files when training on server and local RLlib	1	286	February 3, 2023
Empty checkpoint files with Tune.run RLlib	1	381	March 30, 2022
Ray restore checkpoint in rllib RLlib	6	1640	August 11, 2021
Force Save Checkpoint from Policy Client RLlib	0	180	October 18, 2023
The `process_trial_save` operation took X s, which may be a performance bottleneck Checkpointing, Restoring	1	525	March 8, 2023

Restore checkpoint saved with client-server

Related topics