Semi-MDPs and RLlib: Problems where times to execute an action strongly vary

Hello RLlib community,

a reaseacher I work together made me aware of the following:
If the duration of executing an action strongly varies between various actions, then he suggested to discount for the length of executing the action (i.e. gamma^{time to execute action} instead of just gamma).
Correct me if I’m wrong, such a problem where actions can take different time to execute are known as Semi-MDPs (i.e. length of timesteps is irregular). As far as I know, RLlib interally handles all steps regardless of their actual lengths/durations as “one single step” and doesn’t account for the described problem.

So my question is, can RLlib also account for this problem (Semi-MDPs)? Or do you think one can ignore this fact and treat Semi-MDPs just as standard MDPs? Any experiences or suggestions?

PS: In case of PPO, I suppose that changes might be required in compute_advantages in postprocessing.py and also information about the time to execute actions has to be carried along ((s, a, r, s’) → (s, a, d, r, s’) where d=duration of executing an action).

1 Like

Hey @klausk55 , great questions and thanks for describing this interesting problem.
Yes, that’s exactly correct: RLlib pretends that we have discrete time steps w/o any difference in the time-deltas between actions. Yes, you could “fix” the rewards afterwards in postprocess_trajectory (via overriding the on_postprocess_trajectory in your custom callback object; see: ray.rllib.examples.custom_metrics_and_callbacks.py) and in there changing the “rewards” key in the incoming batch. But yes, for that you’d need the time-deltas somewhere, like in the “infos” dicts coming from the env? Or are part of the observations?

1 Like

Hey @sven1977, if I get you right then you would simply suggest to discount the rewards by their corresponding time-deltas (i.e. r_i*gamma^d_i where d_i time-delta for reward r_i) in the postprocessing of a trajectory.
If so, this should be equivalent to immediately discount rewards by their corresponding time-deltas before returning them from the env, isn’t it? Of course, I have to know the time-deltas otherwise I won’t be able to do this.

If there is no mistake in my line of thinking, would this approach be enough to tackle Semi-MDPs?