[RLlib] Multi-headed DQN

Hi,

I want to customize RLLib’s DQN in a way that it outputs n (let’s say 10) number of Q-values where each Q-value uses a different discount factor gamma that is also passed as an input argument. I am trying to implement an architecture from this paper which is shown on page 14 Figure 9. I have 2 questions:

  1. Can I can try to define a CustomModel class using this RLlib example code which could implement this architecture? Is this doable in a way that it does not mess up with the rest of RLlib (which I am trying to learn and am no expert). I want to use a TF model.
  2. What will happen to the RLib’s config [gamma] as I don’t want to implement a fixed gamma value, rather I want to pass a list of gammas (1 for each Q-value) when the neural network is created? I am not sure how the config [gamma] will behave in this case.

I am thankful to the Ray team and would appreciate any help/pointers in this direction. Thank you.

Hi RLlib team (@sven1977, @ericl, others), I opened an issue on GitHub with a Colab for you guys to reproduce. I will appreciate if you guys can take a look.

Hey @rfali , yeah, this should work with a custom Q-model (just sub-class the DQNTFModel and implement this logic). You’d also probably have to change the DQN loss function, though.

Just create a new DQNTFPolicy via:

MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function]).

Gamma from the config is only used in the loss function and maybe the n-step postprocessing function, so you’d have to do either n_step=1 or also implement your own postprocessing function, like so:

MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function], postprocess_fn=[your own postprocessing fn doing n-step with different gammas]).

For action choices, you also may have to specify your own action_distribution_fn:

MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function], postprocess_fn=[your own postprocessing fn doing n-step with different gammas], action_distribution_fn=[your own action picking and action distribution fn]).

Here is the docstring:

        action_distribution_fn (Optional[Callable[[Policy, ModelV2, TensorType,
            TensorType, TensorType],
            Tuple[TensorType, type, List[TensorType]]]]): Optional callable
            returning distribution inputs (parameters), a dist-class to
            generate an action distribution object from, and internal-state
            outputs (or an empty list if not applicable). If None, will either
            use `action_sampler_fn` or compute actions by calling self.model,
            then sampling from the so parameterized action distribution.

@sven1977 Thank you, that helps a lot in understanding how rllib is great at customization!! For anyone else looking at this and wanting to do something similar, I found this and this example to be very helpful.

As per your help, I think I am going to do the following (I want to use an APEX trainer with a custom model and custom loss function). See here for more details.

from ray.rllib.agents.dqn import DQNTFPolicy
from ray.rllib.agents.dqn.apex import ApexTrainer

def custom_build_q_losses():
      #**custom code**
def custom_postprocess_nstep_and_prio():
def custom_get_distribution_inputs_and_class():

MyDQNPolicy = DQNTFPolicy.with_updates(
        loss_fn=custom_build_q_losses, #compute q-loss
        postprocess_fn=custom_postprocess_nstep_and_prio, #adjust n-step returns
        action_distribution_fn=custom_get_distribution_inputs_and_class, # compute q-values
)

CustomTrainer = ApexTrainer.with_updates(
        name="MyAPEXTrainer", get_policy_class=lambda _: MyDQNPolicy)

Since my model is returning multiple sets of Q-Values (one for each gamma), I will have to compute a loss for each Q-value, aggregate the losses and scale (divide by num_gammas).

At the moment I am still going to modify the original dqn_tf_policy.py so that there are least number of moving pieces that can break. Then I will try this approach perhaps. Suggestions?

However, before I do all this, I am still stuck at the model output layer part where I receive the q-values :confused:

@sven1977 re-above. What I am stuck at is that the custom model outputs not 1 set of Q-values, but multiple (equal to the discount factors I want to use e.g. 3 below). Complete callback is here and custom model here on the GitHub Issue (should we keep it here or there?).

ValueError: Layer model expects 1 input(s), but it received 3 input tensors.

Now I know that this is exactly what my model outputs, but I am having trouble making my model interface with the compute-q-values function. It should be a simple fix, but I am lost (and a newbie). Should I:

  1. Concatenate all the q-value layers into one and output that? At the moment I am not doing this, because the original paper didn’t and more importantly because then I am not sure how loss will flow propagate.
  2. Most importantly, what change should I do to receive a list of q-value tensors? Should i change the shape in self.model_out or make it a List[TensorType] input in get_q_value_distributions()? (I have tried and failed with these approaches). I know this function outputs the action_scores at this place and once I have those multiple q-value tensors, my life would be easier.

Would appreciate your response. Thank you

Closing this issue. I had to subclass my custom model to the TFModelV2 instead of Distributional_Q_TF_Model as the config[“num_outputs”] was messing up with my code.

One thing that added to the confusion is that usually the output of the model in case of DQN is num_actions, but the way RLlib default settings are set up, it is num_outputs, which is the size of the last hidden layer (512 in classic literature i.e. DQN). RLlib allows modularity between conv_nets and MLPs through the config settings, which may be oblivious to a new user.