S. Fang asked this question on our Slack channel.
Please do not use the Slack channel anymore for questions on RLlib! All discussions should be moved here for better searchability and documentation of issues and questions. Thank you.
Hi, I have a question related to action-masks and loss functions. Currently I have an offline dataset of pre-generated episodes that I am using for RL training by applying action-masks which imitate the action sequences + states in the offline experiences. I’m using action-masks because it seemed easier to implement in our complex web-application-based RL setup than using RLLibs SampleBatch API.
However, the imitation learning isn’t having the effect that I’m expecting. Is it perhaps because applying action-masks which basically force the action probability distribution to assign all probability to a single action (the imitation action) also affects the loss function calculation and therefore backprop and gradients?