(What is done-mask and rewards and where are they computed?)
My understanding is that the Q-update actually happens in line 92 of dqn_torch_policy.py (for the torch implantation). For the n-step case, is “rewards” in line 92 actually the n-step return or a tensor of n rewards? Where and how is n-step return computed? and is “q_tp1_best_masked” effectively the Q value for the n-th step? It is computed using mask_done. But What is mask_done and how/where is it computed?
Both rewards and mask_done are passed to QLoss which is called in line 287 of dqn_torch_policy.py. Rewards is assigned using train_batch[SampleBatch.REWARDS] and done_mask using train_batch[SampleBatch.DONES].float(). Train_batch is passed as an argument to build_q_losses, define in line 221. it looks like to me that Train_batch is a dictionary where “SampleBatch.REWARDS” and “SampleBatch.DONES” are the keys but how are the values computed?