Adapted GAE formula ==> PPO algorithm used to solve problems modeled as a Semi-Markov Decision Process

Hello community,

Does anyone have experiences with problems modeled as a Semi-Markov Decision Process (SMDP) and had tried to use the PPO algorithm for training a policy?
In cases where a problem is modeled as a SMDP, we have different and not fixed lengths of the transition steps (or in other words, actions between steps take a different/variable time for their execution). As far as I know, one should account for this aspect because otherwise learning could stuck resp. fail. I guess one way to account for that difference compared to standard MDPs could be time-based discounting of rewards, but I’m not sure which consequences this might have with respect to the PPO algo!?! I know PPO uses the general advantage estimation (GAE) with the two discount factors gamma and lambda and additionally has a horizon parameter H for this advantage estimates. Any adaptations necessary here?

PS: Or is there any other algorithm specifically designed for SMDPs?

In the meantime, I have written a short document containing a proposal for an adapted GAE formula which uses trajectories with actions of variable length for advantage estimation.
What do you think about my proposal? Do you think it is reasonable?

If so, I guess it’s not to much work to implement this adapted GAE formula in a similar manner like the standard GAE for PPO is implemented in RLlib.
Feedback is really appreciated! :v: