Hi fedetask. Reviving a bit of an older thread here, but I wanted to know if you found an answer to the exact problem you write about in this message. I am also working on a problem where I need invalid actions to be masked out (which my internal model already handles), but I do still need to apply noisy q_hiddens layers afterwards to get Distributional-Q values.
If I do the masking in the internal model as I am doing now, and then feed that (model_out) into q_hiddens, then I will essentially be passing -inf values to the q_hiddens layers, which TensorFlow probably won’t like. So did you instead apply the -inf masks to the output of the Distributional-Q model, instead of the internal model? Or did you do some masking in both the internal model and the outer model? Maybe just setting the masked out action outputs to 0 or tf.float32.min instead of -inf?
@arturn Do you have any advice on this?