Hi everyone,
I’m working with PPO and GAE (λ = 0.99) in an environment where rewards are significantly delayed over time. That is, the effect of my actions manifests only after many steps and is the result of a sequence of intermediate decisions.
The main challenge is that the time window within which my actions influence the outcome is not fixed, meaning I cannot know in advance how long it will take for an action to impact the final reward.
I am currently using Ray RLlib and came across the Trajectory View API, which allows access to specific segments of the trajectory during training. However, I am unsure whether and how this API could be helpful in my case, or how to configure it to handle rewards that emerge only after a long and variable number of steps.
Would the Trajectory View API be suitable for this scenario? If not, what alternative approaches could help stabilize learning in such an environment?
Any insights or experiences would be greatly appreciated!
Thanks in advance!
L.