if i want to continue training my algorithm after saving it in the checkpoint, how can i make it?
my idea is show as following but the grammer is not right that algo can not be directly used in the tune.Tuner:
algo = config.build()
algo.restore(user_checkpoint_dir)
results = tune.Tuner(
algo,
param_space=config,
run_config=air.RunConfig(
stop=stop,
verbose=2,
checkpoint_config=air.CheckpointConfig(checkpoint_at_end=True),
),
).fit()
Hi,
This issue have been raised multiple time and it seems that there is no clear solution.
I have been trying to link all the github issue related to that on my own github issue here : Fails restoring weights · Issue #41508 · ray-project/ray · GitHub.
Basically you have some way to resume failed tuning with Trainer.restore
but you cannot pass a new configuration so it’s useless.
ray/rllib/examples/restore_1_of_n_agents_from_checkpoint.py at master · ray-project/ray · GitHub show a way to restore weights of your policy but I have found it to be failing.
Any working example of how to do this is welcome as many people seem to have this problem
thanks for your reply, i will continue looking for a way to solve this problem. I think the problem lies in the tune startup process, but I still don’t have the time or ability to look at the code bit by bit. I will get back to you if there is any progress