[Rllib] how to restore trainer from different checkpoint files when training on server and local

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have trained a PPO agent on a linux server with trainer.train() and then stored the progress with .save(). My problem is, even when I run the exact code on my local machine and on the server, I get different files in the checkpoint directory. I would like to restore the trained checkpoint on my local machine but this gives me the error: “[Errno 2] No such file or directory: ‘/Users/…/checkpoint_000200.tune_metadata’” and there is actually no tune metadata file in the server checkpointz path whereas I do get an tune metadata file when running the exact same code locally.

Could someone explain to mwe why the checkpoints are different and how I could restore my trainer locally from the server checkpoints? thank!

A: Example Server Checkpoints
image

B: Example Locally trained checkpoints:
image

So, the problem was that I used Ray 2.2 on the server which produces the checkpoint files as shown in A) while I used Ray 1.13 locally which produced B). I could just create a new environment with ray 2.2 on the local machine to restore the checkpoints from the server.