I trained a model in an environment using PPO.
I restored the model as an algo
and a policy
from a same checkpoint.
The (average) episode rewards of the two were quite different, which was unexpected.
I checked their actions computed with same observations using compute_single_action(obs)
.
However, the actions were also significantly different even if I set explore=False
.
The weights of the model were the same in the algo and the policy.
I expected they behave the same when using compute_single_action(obs, explore=False)
.
Did I miss something?
One theory would be: compute_single_action()
works a bit differently depending on the object of the method, algo
vs policy
. This was because another policy
from =algo.get_policy()
showed almost identical average performance to the policy
from the checkpoint, rather than the algo
…
I did the same thing for default PPO in CartPole-v1
. But the average behavior was almost identical… It makes me more confused…