I want to customize RLLib’s DQN in a way that it outputs n (let’s say 10) number of Q-values where each Q-value uses a different discount factor gamma that is also passed as an input argument. I am trying to implement an architecture from this paper which is shown on page 14 Figure 9. I have 2 questions:
Can I can try to define a CustomModel class using this RLlib example code which could implement this architecture? Is this doable in a way that it does not mess up with the rest of RLlib (which I am trying to learn and am no expert). I want to use a TF model.
What will happen to the RLib’s config [gamma] as I don’t want to implement a fixed gamma value, rather I want to pass a list of gammas (1 for each Q-value) when the neural network is created? I am not sure how the config [gamma] will behave in this case.
I am thankful to the Ray team and would appreciate any help/pointers in this direction. Thank you.
Hi RLlib team (@sven1977, @ericl, others), I opened an issue on GitHub with a Colab for you guys to reproduce. I will appreciate if you guys can take a look.
Hey @rfali , yeah, this should work with a custom Q-model (just sub-class the DQNTFModel and implement this logic). You’d also probably have to change the DQN loss function, though.
Just create a new DQNTFPolicy via:
MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function]).
Gamma from the config is only used in the loss function and maybe the n-step postprocessing function, so you’d have to do either n_step=1 or also implement your own postprocessing function, like so:
MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function], postprocess_fn=[your own postprocessing fn doing n-step with different gammas]).
For action choices, you also may have to specify your own action_distribution_fn:
MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function], postprocess_fn=[your own postprocessing fn doing n-step with different gammas], action_distribution_fn=[your own action picking and action distribution fn]).
Here is the docstring:
action_distribution_fn (Optional[Callable[[Policy, ModelV2, TensorType,
TensorType, TensorType],
Tuple[TensorType, type, List[TensorType]]]]): Optional callable
returning distribution inputs (parameters), a dist-class to
generate an action distribution object from, and internal-state
outputs (or an empty list if not applicable). If None, will either
use `action_sampler_fn` or compute actions by calling self.model,
then sampling from the so parameterized action distribution.
@sven1977 Thank you, that helps a lot in understanding how rllib is great at customization!! For anyone else looking at this and wanting to do something similar, I found this and this example to be very helpful.
As per your help, I think I am going to do the following (I want to use an APEX trainer with a custom model and custom loss function). See here for more details.
Since my model is returning multiple sets of Q-Values (one for each gamma), I will have to compute a loss for each Q-value, aggregate the losses and scale (divide by num_gammas).
At the moment I am still going to modify the original dqn_tf_policy.py so that there are least number of moving pieces that can break. Then I will try this approach perhaps. Suggestions?
However, before I do all this, I am still stuck at the model output layer part where I receive the q-values
@sven1977 re-above. What I am stuck at is that the custom model outputs not 1 set of Q-values, but multiple (equal to the discount factors I want to use e.g. 3 below). Complete callback is here and custom model here on the GitHub Issue (should we keep it here or there?).
ValueError: Layer model expects 1 input(s), but it received 3 input tensors.
Now I know that this is exactly what my model outputs, but I am having trouble making my model interface with the compute-q-values function. It should be a simple fix, but I am lost (and a newbie). Should I:
Concatenate all the q-value layers into one and output that? At the moment I am not doing this, because the original paper didn’t and more importantly because then I am not sure how loss will flow propagate.
Most importantly, what change should I do to receive a list of q-value tensors? Should i change the shape in self.model_out or make it a List[TensorType] input in get_q_value_distributions()? (I have tried and failed with these approaches). I know this function outputs the action_scores at this place and once I have those multiple q-value tensors, my life would be easier.
Closing this issue. I had to subclass my custom model to the TFModelV2 instead of Distributional_Q_TF_Model as the config[“num_outputs”] was messing up with my code.
One thing that added to the confusion is that usually the output of the model in case of DQN is num_actions, but the way RLlib default settings are set up, it is num_outputs, which is the size of the last hidden layer (512 in classic literature i.e. DQN). RLlib allows modularity between conv_nets and MLPs through the config settings, which may be oblivious to a new user.