Action masks and loss functions

S. Fang asked this question on our Slack channel.
Please do not use the Slack channel anymore for questions on RLlib! All discussions should be moved here for better searchability and documentation of issues and questions. Thank you.

Hi, I have a question related to action-masks and loss functions. Currently I have an offline dataset of pre-generated episodes that I am using for RL training by applying action-masks which imitate the action sequences + states in the offline experiences. I’m using action-masks because it seemed easier to implement in our complex web-application-based RL setup than using RLLibs SampleBatch API.
However, the imitation learning isn’t having the effect that I’m expecting. Is it perhaps because applying action-masks which basically force the action probability distribution to assign all probability to a single action (the imitation action) also affects the loss function calculation and therefore backprop and gradients?

Not sure I understand your exact setup.
My first questions would be:

  • Which offline algo are you using? Pure Behavior cloning (BCTrainer)?
  • What’s your action space?
  • Where exactly are you applying the masking? After the network output and before the loss calculation?
    If yes, then that could create issues as you may be obfuscating the parameterization of the action distribution output by your network (and making lots of useful gradients zero).
  • Also, am I understanding correctly that each mask only has one valid (discrete) action, which is given by the offline dataset?