Hello. I have been trying to set up action masking for a model I have been working on. For testing purposes I have just been trying to get any type of action masking working using the action_mask_model
I have been using a Discrete(50) observation space and a Discrete(50) action space. In each training step I get the current step % 50 and use that to determine the observation space and subsequently the value I want to be returned as the action (e.g step: 101 → observation value 1 and expected action value 1)
Other info: I am using PPO and the tf version of the action mask model
I have used print statements to confirm that it is definitely producing the correct observation value and action mask {0, 1, 0, 0, 0, … 0, 0}. However the action will always still return a random value between 0- 49.
I initially tested it where regardless of the step and observation it should always be masked to return 1 and this worked. I do not understand why adding variation in the mask and observation would make a difference
Side note: I have the action_mask_model.py saved locally with some print statements added to the forward method so I could try and debug why this may be happening. For some reason print statements only show up for the first 3 runs of forward during the build of the policy and then never again. I have even tried adding a global variable that increments when forward runs and then raised an exception when it exceeds 3 and this never gets triggered. I have been unable to understand why this is happening either