How is the n-step DQN implemented in rllib?

(What is done-mask and rewards and where are they computed?)

My understanding is that the Q-update actually happens in line 92 of dqn_torch_policy.py (for the torch implantation). For the n-step case, is “rewards” in line 92 actually the n-step return or a tensor of n rewards? Where and how is n-step return computed? and is “q_tp1_best_masked” effectively the Q value for the n-th step? It is computed using mask_done. But What is mask_done and how/where is it computed?

Both rewards and mask_done are passed to QLoss which is called in line 287 of dqn_torch_policy.py. Rewards is assigned using train_batch[SampleBatch.REWARDS] and done_mask using train_batch[SampleBatch.DONES].float(). Train_batch is passed as an argument to build_q_losses, define in line 221. it looks like to me that Train_batch is a dictionary where “SampleBatch.REWARDS” and “SampleBatch.DONES” are the keys but how are the values computed?

Hi @Terpragon,

Welcome to the forum.

Both of those values will contain a batch of data drawn from a replay buffer.

The values in DONES are whatever was returned by the environment for that step.

The REWARDS are the n_step returns updated inplace during post-processing with this function:

DQN calls it from here:

1 Like

Thank you! Exactly what I needed.