# Intermediate rewards and adjusted gamma for DQN/APEX (Semi-Markov Decision Process)

Hi!
I am using ray rllib in combination with a traffic simulator in order to train an agent that controls traffic lights.
Let’s say the min green time of a traffic light phase is 5sec and the transition time (green -> yellow -> red) is 3 sec. I.e. my agent ist interacting with the traffic lights non-periodically (sometimes 5s and sometimes 3s+5s=8s), which is different to the “normal” Markov Decision Process Framework.

In order to tackle this, I have to use the Semi Markov Decision Process Framework. For Q-learning this modifies the TD error as follows:
`r_t + gamma * r_{t+1} + ... + gamma^N * r_{t+N} + gamma^{N+1} * maxQ(s',a') - Q(s,a)`
where `r_t, r_{t+1} ...` are equally spaced intermediate rewards, which are calculated every 1 seconds.

The calculation of `r_t + gamma * r_{t+1} + ... + gamma^N * r_{t+N} ` is done in my custom environment.
But dependent on the distance between two interactions points of my agent with the traffic lights (5s or 8s) the algorithm has to discount `maxQ(s',a')` with a different `N` in `gamma^{N+1}`. How can I tell the algorithm (DQN or APEX) which discount factor to choose?

Thanks guys in advance for any suggestions!

1 Like

Could you try to publish the time delta from the last interaction to the current one in your env observations (Dict[actual_obs: Box(…), time_delta: Box((1,), float32)])?
And then use that information during postprocessing (similar to how we do n-step in DQN already)?
Just an idea. Let me know, if this wouldn’t work.

I am not quite sure if I get you right with the postprocessing part? Do you mean that I should try to use the time_delta information in the callback-function “on_postprocess_trajectory”? And there somehow maipulate Q(s’,a’)? Thanks for clarification on this.

Yeah, I think that would work. Here is a step-by-step guide. But let me know, if I have a mistake somewhere here:

1. Make your env return an info dict from the step() method (the 4th return value), in which you store the number of seconds passed since the last interaction.
2. Use a CustomCallbacks sub-class and override its `on_postprocess_trajectory` to calculate the gammas use for each step, then store these back in the `postprocessed_batch` arg (you can create a new key in that SampleBatch). Specify that sub-class under your Trainer’s “callbacks” config key.
3. In your loss function, you should now be able to see the “gamma” column, which you can use to multiply your Q values with.

Would that work or am I missing something?

This approach makes sense to me, thanks.

I am able to implement the first 2 points. But regarding the loss function I am not quite sure what is the best way to change/modify it, let’s say for the Apex algorithm? From my understanding I have to do the following:

``````from ray.rllib.agents.dqn import ApexTrainer
from ray.rllib.agents.dqn.dqn_torch_policy import DQNTorchPolicy

And then I would copy `build_q_losses` from `rllib.agents.dqn.dqn_torch_policy.py` and rename it to `semi_markov_loss_fn` and also I would copy `class QLoss` from the same file. Then I would do the changes in `semi_markov_loss_fn` and `class QLoss` in order to adjust the gamma values. By copying the existing code and doing only minor changes I try to keep the basic structure of the code, so that all “rainbow” extensions are still compatible with my code. Does this make sense or is there a better way? Am I missing something?
Yes, this looks totally fine. Make sure that your Trainer really uses the SemiMarkovPolicy, though. In older versions, the specified `default_policy` would get overwritten by the return value of the `get_policy_class` function (specified in ApexTrainer or DQNTrainer).