PPO: Value estimate off for goal state

matthias-brucklacher · September 18, 2024, 4:55pm

Hi! I have some issues getting PPO to run on my custom environment. I observe that the value estimate at the goal state is negative (while it is correctly positive and large at the preceding state) and suspect that this is related to issues in learning. Typically, the policy will get stuck in local minima.

MDP:
The agent receives -0.05 reward per timestep, +0.5 for reaching the goal, after 15 timesteps the episode is truncated. The discrete action space in 10-dimensional and the ideal sequence is 1 to 8 steps long, depending on the random initial state.

Concrete Question:
If I understand correctly, PPO learns from experience tuples (S, A, R, S’), using bootstrapping for the critic. Because of this bootstrapping, a good value estimate for the goal state should be important. But how is the critic ever supposed to learn a correct value estimate of the goal state if this is where the episode always ends? I.e. there is no next reward!

Topic		Replies	Views
Prediction outside outside action space during inference	0	106	March 18, 2024
PPO forgetting some good actions RLlib	1	262	November 30, 2022
Global optima with centralized critic (basic understanding) RLlib	10	2336	April 10, 2021
Change action space within episode RLlib	2	273	December 28, 2021
PPO order of actions/obs/rewards scrambled RLlib	1	473	January 15, 2022

PPO: Value estimate off for goal state

Related topics