I trained a model in an environment using PPO.
I restored the model as an
algo and a
policy from a same checkpoint.
The (average) episode rewards of the two were quite different, which was unexpected.
I checked their actions computed with same observations using
However, the actions were also significantly different even if I set
The weights of the model were the same in the algo and the policy.
I expected they behave the same when using
Did I miss something?
One theory would be:
compute_single_action()works a bit differently depending on the object of the method,
policy. This was because another
=algo.get_policy() showed almost identical average performance to the
policy from the checkpoint, rather than the
I did the same thing for default PPO in
CartPole-v1. But the average behavior was almost identical… It makes me more confused…