Handling delayed rewards in PPO with GAE – Alternatives and Trajectory View API in Ray RLlib

Hi everyone,

I’m working with PPO and GAE (λ = 0.99) in an environment where rewards are significantly delayed over time. That is, the effect of my actions manifests only after many steps and is the result of a sequence of intermediate decisions.

The main challenge is that the time window within which my actions influence the outcome is not fixed, meaning I cannot know in advance how long it will take for an action to impact the final reward.

I am currently using Ray RLlib and came across the Trajectory View API, which allows access to specific segments of the trajectory during training. However, I am unsure whether and how this API could be helpful in my case, or how to configure it to handle rewards that emerge only after a long and variable number of steps.

Would the Trajectory View API be suitable for this scenario? If not, what alternative approaches could help stabilize learning in such an environment?

Any insights or experiences would be greatly appreciated!

Thanks in advance!

L.

Hi @LeoLeoLeo,

thanks for the great question! What you’re describing is a common challenge in reinforcement learning, known as the credit assignment problem. RL algorithms address this in various ways. In PPO, for instance, the value function - particularly when used with Generalized Advantage Estimation (GAE) - helps to tackle long-horizon, sparse reward scenarios. By learning to predict the expected return from a given state when following the current policy, the algorithm can effectively account for delayed rewards over time.

The Trajectory View API is part of RLlib’s old stack. In the new stack - we suggest to use this stack b/c the old stack will be deprecated soon - the Episode API together with ConnectorV2 API give you direct access to complete trajectories.

In case you suffer under sparse rewards and more exploration is needed, you could take a look at Curiosity-driven Exploration.