Semi-MDPs and RLlib: Problems where times to execute an action strongly vary

klausk55 · July 22, 2021, 12:38pm

Hello RLlib community,

a reaseacher I work together made me aware of the following:
If the duration of executing an action strongly varies between various actions, then he suggested to discount for the length of executing the action (i.e. gamma^{time to execute action} instead of just gamma).
Correct me if I’m wrong, such a problem where actions can take different time to execute are known as Semi-MDPs (i.e. length of timesteps is irregular). As far as I know, RLlib interally handles all steps regardless of their actual lengths/durations as “one single step” and doesn’t account for the described problem.

So my question is, can RLlib also account for this problem (Semi-MDPs)? Or do you think one can ignore this fact and treat Semi-MDPs just as standard MDPs? Any experiences or suggestions?

PS: In case of PPO, I suppose that changes might be required in compute_advantages in postprocessing.py and also information about the time to execute actions has to be carried along ((s, a, r, s’) → (s, a, d, r, s’) where d=duration of executing an action).

sven1977 · August 3, 2021, 7:30pm

Hey @klausk55 , great questions and thanks for describing this interesting problem.
Yes, that’s exactly correct: RLlib pretends that we have discrete time steps w/o any difference in the time-deltas between actions. Yes, you could “fix” the rewards afterwards in postprocess_trajectory (via overriding the on_postprocess_trajectory in your custom callback object; see: ray.rllib.examples.custom_metrics_and_callbacks.py) and in there changing the “rewards” key in the incoming batch. But yes, for that you’d need the time-deltas somewhere, like in the “infos” dicts coming from the env? Or are part of the observations?

klausk55 · August 4, 2021, 12:55pm

Hey @sven1977, if I get you right then you would simply suggest to discount the rewards by their corresponding time-deltas (i.e. r_i*gamma^d_i where d_i time-delta for reward r_i) in the postprocessing of a trajectory.
If so, this should be equivalent to immediately discount rewards by their corresponding time-deltas before returning them from the env, isn’t it? Of course, I have to know the time-deltas otherwise I won’t be able to do this.

If there is no mistake in my line of thinking, would this approach be enough to tackle Semi-MDPs?

Topic		Replies	Views
Adapted GAE formula ==> PPO algorithm used to solve problems modeled as a Semi-Markov Decision Process RLlib	1	384	November 17, 2021
Intermediate rewards and adjusted gamma for DQN/APEX (Semi-Markov Decision Process) RLlib	6	609	November 18, 2021
Multi-agent setting different step sizes for agents and how actions are passed? RLlib	2	638	April 26, 2022
Handling delayed rewards in PPO with GAE – Alternatives and Trajectory View API in Ray RLlib RLlib	1	49	April 15, 2025
TrajectoryTracking with RLLIB RLlib	14	1352	November 17, 2021

Semi-MDPs and RLlib: Problems where times to execute an action strongly vary

Related topics