Compute_single_action(obs, state) of policy and algo: different performance

JayCarrot · March 1, 2023, 8:06pm

I trained a model in an environment using PPO.

I restored the model as an algo and a policy from a same checkpoint.

The (average) episode rewards of the two were quite different, which was unexpected.
I checked their actions computed with same observations using compute_single_action(obs).
However, the actions were also significantly different even if I set explore=False.
The weights of the model were the same in the algo and the policy.

I expected they behave the same when using compute_single_action(obs, explore=False).
Did I miss something?
One theory would be: compute_single_action()works a bit differently depending on the object of the method, algo vs policy. This was because another policy from =algo.get_policy() showed almost identical average performance to the policy from the checkpoint, rather than the algo…

I did the same thing for default PPO in CartPole-v1. But the average behavior was almost identical… It makes me more confused…

arturn · April 13, 2023, 11:03pm

You have to be careful with this since Algorithm may apply additional transformations to inputs and outputs of Policy.
Since you are not running into any issues here, I’ll just shoot into the dark:
Please check the outputs of Policy and Algo and see if Policy produces normalized outputs while Algo produces unsquashed outputs.

Topic		Replies	Views
Policy.compute_single_action() wrong outputs RLlib	0	226	October 30, 2023
Inconsistent actions from Algorithm.compute_single_action RLlib	3	420	June 14, 2023
[rllib] Problem running compute_single_action from PPO restored checkpoint Checkpointing, Restoring	1	360	December 13, 2023
Restored Policy gives action that is out of bound Checkpointing, Restoring	1	581	April 13, 2023
Get_policy error when get an action from restored trained model- New API stack	12	111	April 22, 2025

Compute_single_action(obs, state) of policy and algo: different performance

Related topics