How is the n-step DQN implemented in rllib?

Terpragon · January 16, 2022, 9:18pm

(What is done-mask and rewards and where are they computed?)

My understanding is that the Q-update actually happens in line 92 of dqn_torch_policy.py (for the torch implantation). For the n-step case, is “rewards” in line 92 actually the n-step return or a tensor of n rewards? Where and how is n-step return computed? and is “q_tp1_best_masked” effectively the Q value for the n-th step? It is computed using mask_done. But What is mask_done and how/where is it computed?

Both rewards and mask_done are passed to QLoss which is called in line 287 of dqn_torch_policy.py. Rewards is assigned using train_batch[SampleBatch.REWARDS] and done_mask using train_batch[SampleBatch.DONES].float(). Train_batch is passed as an argument to build_q_losses, define in line 221. it looks like to me that Train_batch is a dictionary where “SampleBatch.REWARDS” and “SampleBatch.DONES” are the keys but how are the values computed?

mannyv · January 17, 2022, 6:51pm

Hi @Terpragon,

Welcome to the forum.

Both of those values will contain a batch of data drawn from a replay buffer.

The values in DONES are whatever was returned by the environment for that step.

The REWARDS are the n_step returns updated inplace during post-processing with this function:

github.com

ray-project/ray/blob/f82880eda18dced7ea396d0dcced9246909de79f/rllib/evaluation/postprocessing.py#L19

    
      
          from ray.rllib.utils.typing import AgentID
          
          

          
class Postprocessing:
              """Constant definitions for postprocessing."""
          
          
    ADVANTAGES = "advantages"
              VALUE_TARGETS = "value_targets"
          
          

          
def adjust_nstep(n_step: int, gamma: float, batch: SampleBatch) -> None:
              """Rewrites `batch` to encode n-step rewards, dones, and next-obs.
          
          
    Observations and actions remain unaffected. At the end of the trajectory,
              n is truncated to fit in the traj length.
          
          
    Args:
                  n_step: The number of steps to look ahead and adjust.
                  gamma: The discount factor.
                  batch: The SampleBatch to adjust (in place).

DQN calls it from here:

github.com

ray-project/ray/blob/2d24ef0d3234867ac329b10ae3a11b9b7119d17b/rllib/agents/dqn/dqn_tf_policy.py#L387

    
      
                      action_scores_mean = reduce_mean_ignore_inf(action_scores, 1)
                      action_scores_centered = action_scores - tf.expand_dims(
                          action_scores_mean, 1)
                      value = state_score + action_scores_centered
              else:
                  value = action_scores
          
          
    return value, logits, dist, state
          
          

          
def postprocess_nstep_and_prio(policy: Policy,
                                         batch: SampleBatch,
                                         other_agent=None,
                                         episode=None) -> SampleBatch:
              # N-step Q adjustments.
              if policy.config["n_step"] > 1:
                  adjust_nstep(policy.config["n_step"], policy.config["gamma"], batch)
          
          
    # Create dummy prio-weights (1.0) in case we don't have any in
              # the batch.
              if PRIO_WEIGHTS not in batch:

Terpragon · January 18, 2022, 8:38pm

Thank you! Exactly what I needed.

Topic		Replies	Views
Confused about "adjust_nstep" RLlib	0	134	August 12, 2023
[RLlib] Multi-headed DQN RLlib	5	1335	June 13, 2021
Intermediate rewards and adjusted gamma for DQN/APEX (Semi-Markov Decision Process) RLlib	6	594	November 18, 2021
DQN (and maybe other algo) should take into account the "num_envs_per_worker" config when computing the round robin native_ratio used to determined the number of steps to use for training RLlib	3	478	April 21, 2021
'MultiAgentBatch' object has no attribute 'get' when using DQN and storing sequences in the Replay Buffer Debugging and performance tuning	0	118	May 11, 2024

How is the n-step DQN implemented in rllib?

Related topics