Is Rainbow/DQN really usable with parametric action spaces?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.


I’m mostly asking these questions to make sure I understand the warning / the documentation properly.

  1. According to the parametric action spaces section in the documentation, DQN can only be used with hiddens: [] inside the trainer config. So that basically means there is no hidden layers at all in the neural network used by DQN, correct? If so, then is DQN really usable with parametric action spaces since we can’t define a “true” NN?

  2. Similarly, having to set dueling to False inside the trainer config, as per the cartpole example with masked actions, means it is impossible to use Rainbow DQN on parametric action spaces, correct?

Perhaps I am missing something, but I can’t gather more from what I’m reading in the docs. If I am mistaken, please tell me what I am misunderstanding!

Thanks for the question.

I believe these limitations only apply to the specific models used in the example script (ParametricActionsModel or TorchParametricActionsModel).

if you write your own custom model by following the documentation you linked to, you should be able to use any kind of NN you like, as long as it understands the input obs format.

I thought so too, but I (think) I have delved pretty deep inside the API and there seems to be no effective way to use Dueling and Rainbow with -inf masks. The only way I’ve found to actually use parametric actions spaces is to add an attribute to store a mask in a custom DQNTorchModel, then modify the compute_q_values function of RLLib for it to fetch the mask from the DQNTorchModel and then apply this mask to the value variable, post computations, before they are sent to the policy.

The reason it isn’t as straightforward as you suggest, I think, is because distributional DQN needs a mask applied only at the end of the q-values computation, not inside the DQN, on the distributions directly. This seems to be a mix of reasons, notably the computations inside compute_q_values, specifically the sections related to distributional DQN, don’t mix well with a -inf mask which is what the policy requires in order to ignore actions. It has been a while since I’ve rechecked that code, but I think is was the value variable’s computation which led to errors on my end.

I see. that sounds totally plausible. you have probably looked at this specific problem a lot more than any of us :slight_smile:
do you think creating a custom policy inheriting DQN policy and overriding compute_q_values(), so that it understands masks is something ok with you?
I’d actually invite you to create a PR if you can get this working, and we can make it an awesome example for the community.
if you run into any blocking issues, please also share what you have. it’s much easier to understand a particular issue if we have a repro script.

I have made a fork and made my changes there. I also asked for feedback on this forum a couple of days ago. (link to thread: [Contribution] [Help needed] Implementing easy action masking for distributional and dueling DQN). Perhaps we can talk about it more there?