Intermediate rewards and adjusted gamma for DQN/APEX (Semi-Markov Decision Process)

Hi!
I am using ray rllib in combination with a traffic simulator in order to train an agent that controls traffic lights.
Let’s say the min green time of a traffic light phase is 5sec and the transition time (green -> yellow -> red) is 3 sec. I.e. my agent ist interacting with the traffic lights non-periodically (sometimes 5s and sometimes 3s+5s=8s), which is different to the “normal” Markov Decision Process Framework.

In order to tackle this, I have to use the Semi Markov Decision Process Framework. For Q-learning this modifies the TD error as follows:
r_t + gamma * r_{t+1} + ... + gamma^N * r_{t+N} + gamma^{N+1} * maxQ(s',a') - Q(s,a)
where r_t, r_{t+1} ... are equally spaced intermediate rewards, which are calculated every 1 seconds.

The calculation of r_t + gamma * r_{t+1} + ... + gamma^N * r_{t+N} is done in my custom environment.
But dependent on the distance between two interactions points of my agent with the traffic lights (5s or 8s) the algorithm has to discount maxQ(s',a') with a different N in gamma^{N+1}. How can I tell the algorithm (DQN or APEX) which discount factor to choose?

Thanks guys in advance for any suggestions!

1 Like

Could you try to publish the time delta from the last interaction to the current one in your env observations (Dict[actual_obs: Box(…), time_delta: Box((1,), float32)])?
And then use that information during postprocessing (similar to how we do n-step in DQN already)?
Just an idea. Let me know, if this wouldn’t work.

Thanks Sven for your suggestion!
I am not quite sure if I get you right with the postprocessing part? Do you mean that I should try to use the time_delta information in the callback-function “on_postprocess_trajectory”? And there somehow maipulate Q(s’,a’)? Thanks for clarification on this.

Yeah, I think that would work. Here is a step-by-step guide. But let me know, if I have a mistake somewhere here:

  1. Make your env return an info dict from the step() method (the 4th return value), in which you store the number of seconds passed since the last interaction.
  2. Use a CustomCallbacks sub-class and override its on_postprocess_trajectory to calculate the gammas use for each step, then store these back in the postprocessed_batch arg (you can create a new key in that SampleBatch). Specify that sub-class under your Trainer’s “callbacks” config key.
  3. In your loss function, you should now be able to see the “gamma” column, which you can use to multiply your Q values with.

Would that work or am I missing something?

This approach makes sense to me, thanks.

I am able to implement the first 2 points. But regarding the loss function I am not quite sure what is the best way to change/modify it, let’s say for the Apex algorithm? From my understanding I have to do the following:

from ray.rllib.agents.dqn import ApexTrainer
from ray.rllib.agents.dqn.dqn_torch_policy import DQNTorchPolicy

SemiMarkovPolicy = DQNTorchPolicy.with_updates(
    name="SemiMarkovDQNTorchPolicy", loss_fn=semi_markov_loss_fn)
SemiMarkovApexTrainer = ApexTrainer.with_updates(default_policy=SemiMarkovPolicy)

And then I would copy build_q_losses from rllib.agents.dqn.dqn_torch_policy.py and rename it to semi_markov_loss_fn and also I would copy class QLoss from the same file. Then I would do the changes in semi_markov_loss_fn and class QLoss in order to adjust the gamma values. By copying the existing code and doing only minor changes I try to keep the basic structure of the code, so that all “rainbow” extensions are still compatible with my code. Does this make sense or is there a better way? Am I missing something?

Yes, this looks totally fine. Make sure that your Trainer really uses the SemiMarkovPolicy, though. In older versions, the specified default_policy would get overwritten by the return value of the get_policy_class function (specified in ApexTrainer or DQNTrainer).

@ElektroChan89 and @sven1977
I have written a short document containing a proposal for an adapted GAE formula which uses trajectories with actions of variable length for advantage estimation.
What do you think about my proposal? Do you think it is reasonable and might be used for a more general PPO?