PPO bug: States' values aren't counted if the next action terminates the episode?

I have a question that should be very straightforward, but is causing me a surprising amount of difficulty, perhaps because I’ve missed something important. I’ve created a very simple environment consisting of two timesteps, followed by a reward, and am trying to train PPO to predict value accurately on this environment with lambda=0.

State A -action-> State B -action-> reward+termination

Code reproducing a run of two episodes of this environment under PPO is below:

episode_lens = [3, 3]
vfps = [0.0, 0.95, 0.95, 0.0, 0.95, 0.95]
rewards = [ 0.0, 1.0,  0.0,  0.0,  1.0,  0.0]
terminateds = [False,  True,  True, False,  True,  True]
truncateds = [False, False, False, False, False, False]
gamma = 0.99
lambda_ = 0.0

compute_value_targets(
    values=vfps,
    rewards=unpad_data_if_necessary(
        episode_lens,
        np.array(rewards),
    ),
    terminateds=unpad_data_if_necessary(
        episode_lens,
        np.array(terminateds),
    ),
    truncateds=unpad_data_if_necessary(
        episode_lens,
        np.array(truncateds),
    ),
    gamma=gamma,
    lambda_=lambda_,
)

(The extra timestep at the end of each episode is added by AddOneTsToEpisodesAndTruncate)

Code output:

array([0., 1., 0., 0., 1., 0.], dtype=float32)

Shouldn’t the value target at positions 0 and 3 be equal to gamma*0.95 instead of zero? Setting lambda_ = 0 should cause the value targets to be set purely based on next state value, if I understand properly.

Looking deeper into the code, should AddOneTsToEpisodesAndTruncate remove the terminated flag from the timesteps before the new one that it adds? For example, suppose the following environment:

A -(+0.0)-> B -(+0.0)-> C -(+1.0)-> TERMINATE 
  -(+0.0)-> TERMINATE

The value function, with lambda set to zero, would converge to:

V(A) = 0.0
V(B) = 0.0
V(C) = 1.0

I would expect value bootstrapping, run correctly in conjunction with a random policy, to converge to:

V(A) = gamma**2 / 2
V(B) = gamma
V(C) = 1.0

This discrepancy seems to mean that a PPO agent trained on this environment would fail to converge, because the advantage of reaching state B instead of terminating would be zero.

Am I misunderstanding something about the implementation, or is this a bug?

Your understanding is correct: with λ=0, value targets should be set to the next state’s value prediction (bootstrapping), i.e., target = reward + γ * V(next_state), except at true terminals where the target should be just the reward. However, the output you observe ([0., 1., 0., 0., 1., 0.]) is due to how the terminateds flag is handled: the extra timestep added by AddOneTsToEpisodesAndTruncate does not alter the terminateds flag of the preceding timestep, so the timestep before the artificial one is still marked as terminated. This causes the value target at that position to be set to zero, not to γ * V(next_state), which is not the expected behavior for λ=0 bootstrapping.

This is a known subtlety: the connector does not remove the terminated flag from the penultimate timestep, so the value target calculation treats it as a terminal, not as a state to be bootstrapped from the artificial timestep. As a result, the value targets at those positions are zero, not γ * V(next_state). This can indeed prevent PPO from learning the correct value function in such short, deterministic environments. For more, see the implementation and comments in add_one_ts_to_episodes_and_truncate and compute_value_targets. Would you like a step-by-step breakdown of the value target computation for your example?

Sources:

Hint: Mention @RunLLM in the post for followups.

The LLM just described how the code is currently implemented (with a few issues; I was not asking about bootstrapping from the artificial last timestep, but from the true one), which doesn’t really answer what I’m asking, but the use of “known subtlety” is interesting. Is this the desired behavior (and, if so, what’s the intended way to solve environments with terminal rewards, assuming I wish to use pure value bootstrapping rather than reward discounting to calculate my VTs?), or was the phrasing there just a hallucination?

A more detailed self-contained example can be found here, complete with outputs demonstrating the value bootstrapping issue. It seems to break PPO for any environment with a terminal reward when _lambda is set to zero.

As best I can tell, the issue can be fixed by altering the following line in AddOneTsToEpisodesAndTruncate:

terminateds = (
                [False for _ in range(len_ - 1)]
                #+ [bool(sa_episode.is_terminated)]
                + [False] # FIX: Setting to true if terminated causes last-step rewards to be ignored!
                + [True]  # extra timestep
            )

Can a team member tell me if this fix has other consequences? I’ve thought a bit about it and it doesn’t seem to, as long as the terminated flag is only used in value target calculation.