When checking that everything works fine with my environment I noticed, that the training batch in APPO (torch) also contains observations for non acting agents (all zeros) and therefore also computes actions/ vf_pred for them.
Now my question:
Should I make sure that no optimization is done with those samples? E.g. by detaching gradients for those actions.
Is this even intended behavior. Do the reported stats take those “fake” trajectories into account? I totally get that for implementation reasons this is easier, because the shapes are always the same.
I thought I ask this before going through the APPO code.
Any help appreciated.
Hey @Sertingolix , not sure, but these zeros you are seeing could simply be the initial dummy batch that gets passed through your loss function by RLlib.
can you confirm that you are only seeing those for the first loss-pass in each of your policies. Note that for each remote worker + the local worker, you should see this once, as each of them has a copy of the policy.
This also prevails after the initial dummy batches are processed. I actually get correct/real experience from the environment proportional to the number of acting agents and zero samples otherwise. Also I do not have replay that could lead to samples staying longer in the training loop.
Although i think this should not matter but i use a repeated space in the observation. Just thinking of it now.
I didn’t go through the code yet, because I was able to reduce the environment to an equivalent one with all acting agents and it trains successfully.