Hi! I have some issues getting PPO to run on my custom environment. I observe that the value estimate at the goal state is negative (while it is correctly positive and large at the preceding state) and suspect that this is related to issues in learning. Typically, the policy will get stuck in local minima.
MDP:
The agent receives -0.05 reward per timestep, +0.5 for reaching the goal, after 15 timesteps the episode is truncated. The discrete action space in 10-dimensional and the ideal sequence is 1 to 8 steps long, depending on the random initial state.
Concrete Question:
If I understand correctly, PPO learns from experience tuples (S, A, R, S’), using bootstrapping for the critic. Because of this bootstrapping, a good value estimate for the goal state should be important. But how is the critic ever supposed to learn a correct value estimate of the goal state if this is where the episode always ends? I.e. there is no next reward!