Hello community,
Does anyone have experiences with problems modeled as a Semi-Markov Decision Process (SMDP) and had tried to use the PPO algorithm for training a policy?
In cases where a problem is modeled as a SMDP, we have different and not fixed lengths of the transition steps (or in other words, actions between steps take a different/variable time for their execution). As far as I know, one should account for this aspect because otherwise learning could stuck resp. fail. I guess one way to account for that difference compared to standard MDPs could be time-based discounting of rewards, but I’m not sure which consequences this might have with respect to the PPO algo!?! I know PPO uses the general advantage estimation (GAE) with the two discount factors gamma and lambda and additionally has a horizon parameter H for this advantage estimates. Any adaptations necessary here?
PS: Or is there any other algorithm specifically designed for SMDPs?