Hey @lucas_spangher , thanks for the question. You are right, it’s not super intuitive how to get there. Action logits are processed inside the Policy.exploration object. PPO uses the StochasticSampling exploration module under:
What you should do, it you want to modify these before the sampling step is to sub-class
StochasticSampling, then override its
get_exploration_action method applying your logic on the logits modification and make sure it does the sampling on those modified logits, then returns the actions (just like the parent StochasticSampling does it).
Then in your config, just do:
type: [full path to your new class, e.g. "my_dir.my_exploration.MyExploration"]
[other c'tor args for your class]
This would tell PPO to use your own exploration class, instead.