Adapted GAE formula ==> PPO algorithm used to solve problems modeled as a Semi-Markov Decision Process

klausk55 · November 12, 2021, 10:16am

Hello community,

Does anyone have experiences with problems modeled as a Semi-Markov Decision Process (SMDP) and had tried to use the PPO algorithm for training a policy?
In cases where a problem is modeled as a SMDP, we have different and not fixed lengths of the transition steps (or in other words, actions between steps take a different/variable time for their execution). As far as I know, one should account for this aspect because otherwise learning could stuck resp. fail. I guess one way to account for that difference compared to standard MDPs could be time-based discounting of rewards, but I’m not sure which consequences this might have with respect to the PPO algo!?! I know PPO uses the general advantage estimation (GAE) with the two discount factors gamma and lambda and additionally has a horizon parameter H for this advantage estimates. Any adaptations necessary here?

PS: Or is there any other algorithm specifically designed for SMDPs?

klausk55 · November 17, 2021, 4:26pm

In the meantime, I have written a short document containing a proposal for an adapted GAE formula which uses trajectories with actions of variable length for advantage estimation.
What do you think about my proposal? Do you think it is reasonable?

If so, I guess it’s not to much work to implement this adapted GAE formula in a similar manner like the standard GAE for PPO is implemented in RLlib.
Feedback is really appreciated!

Topic		Replies	Views
Semi-MDPs and RLlib: Problems where times to execute an action strongly vary RLlib	2	230	August 4, 2021
Intermediate rewards and adjusted gamma for DQN/APEX (Semi-Markov Decision Process) RLlib	6	609	November 18, 2021
Handling delayed rewards in PPO with GAE – Alternatives and Trajectory View API in Ray RLlib RLlib	1	49	April 15, 2025
Financial market making using RLLib	0	287	October 13, 2023
Approaching a POMDP problem with RLlib RLlib	1	418	April 13, 2023

Adapted GAE formula ==> PPO algorithm used to solve problems modeled as a Semi-Markov Decision Process

Related topics