How to restore a trained agent to further train it?

cl_tch · July 29, 2021, 5:43pm

I have tried to use the agent.restore(“path to checkpoint”) and then continue training with tune.run(self.agent,…hyperparameters here). I do get a message logged to the console telling me that my agent has been restored, but it seems to create a new directory (i.e. experiment with it’s own metadata) and runs in a perpetual loop, periodically printing out info statements about the task (essentially saying that it’s still pending forever).

This happens even when I add something like 25 episodes (which should take at the most 4 minutes to train) and just constantly prints out the info statement above without doing any actual training. Furthermore, the new directory that is created by running tune.run again by passing in the trained agent is completely empty.

Can someone point me in the right direction as to how I can do both

Properly restore a “fully” trained agent (i.e. one that completed its training loop previously).
and
Continue training this “fully” trained agent for some more training iterations and update only the directory and metadata of this fully trained agent rather than creating a completely new directory?

Thanks so much. For reference, I have looked at this and this but none of them seem to correctly train the agent further and result in the situation above.

cl_tch · July 29, 2021, 5:49pm

This is the message that shows up when I call my load() function which restores the agent:

2021-07-28 12:36:52,082	INFO trainable.py:378 -- Restored on 10.0.0.37 from checkpoint:
....checkpoint_000002\checkpoint-2
2021-07-28 12:36:52,086	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 817.6041111946106, '_episodes_total': 55}

Lars_Simon_Zehnder · July 29, 2021, 5:51pm

Hi @cl_tch ,

maybe this helps you with your problem.

Simon

cl_tch · July 31, 2021, 8:24pm

Will reimporting the weights to the agent and then running it with tune.run(agent, …) work? I am currently using tune.run() in my project with an episodes_total stop condition so I’d rather continue using tune instead of the Python API.

sven1977 · August 2, 2021, 3:01pm

There is a restore option for tune.run, which allows you to provide a checkpoint. Tune.run will restore the created Trainer from this checkpoint and then “continue” training.

E.g.
ray.rllib.examples.unity3d_env_local.py has a --from_checkpoint option.

cl_tch · August 5, 2021, 12:35am

@sven1977

Thanks Sven, that works. Is there a way to set up the call to tune.run when using the restore= “path to checkpoint” parameter to make it so that the experiment actually resumes in the directory of the checkpoint rather than creating a new directory with essentially the same data and storing the checkpoint in this new directory?

Essentially what I mean is, if I specify the restore path can the further trained agent just override the data within that restore path rather than creating a new directory and storing the further trained agent’s data and checkpoint there?

Topic		Replies	Views
Restore agent and continue training with tune.run() RLlib	2	610	July 6, 2021
Restoring RLlib Run Using Tuner.restore RLlib	5	625	February 17, 2024
Another tune after restoring a PPO algorithm Checkpointing, Restoring	2	301	December 15, 2023
RLLib Multiagent: Load only one policy from checkpoint & Compatibility of RLLib/Tune Checkpoints RLlib	9	3297	November 24, 2021
Tuner cannot restore the checkpoints! Ray Tune	10	910	November 20, 2023

How to restore a trained agent to further train it?

Related topics