[Rllib] how to restore trainer from different checkpoint files when training on server and local

Mirakolix_Gallier · January 31, 2023, 5:31pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have trained a PPO agent on a linux server with trainer.train() and then stored the progress with .save(). My problem is, even when I run the exact code on my local machine and on the server, I get different files in the checkpoint directory. I would like to restore the trained checkpoint on my local machine but this gives me the error: “[Errno 2] No such file or directory: ‘/Users/…/checkpoint_000200.tune_metadata’” and there is actually no tune metadata file in the server checkpointz path whereas I do get an tune metadata file when running the exact same code locally.

Could someone explain to mwe why the checkpoints are different and how I could restore my trainer locally from the server checkpoints? thank!

A: Example Server Checkpoints

B: Example Locally trained checkpoints:

Mirakolix_Gallier · February 3, 2023, 4:59pm

So, the problem was that I used Ray 2.2 on the server which produces the checkpoint files as shown in A) while I used Ray 1.13 locally which produced B). I could just create a new environment with ray 2.2 on the local machine to restore the checkpoints from the server.

Topic		Replies	Views
Restore checkpoint saved with client-server RLlib	7	742	August 2, 2022
Ray restore checkpoint in rllib RLlib	6	1626	August 11, 2021
Empty checkpoint files with Tune.run RLlib	1	378	March 30, 2022
Can't properly restore result trained with RLlib using Ray.train.Result RLlib	1	131	May 29, 2024
Unable to restore fully trained checkpoint RLlib	19	2836	October 21, 2023

[Rllib] how to restore trainer from different checkpoint files when training on server and local

Related topics