Training mean reward vs. evaluation mean rewward

Hi @SVH,

If you train with a stochastic policy then you would expect your best performance if you also inferere and evaluate with a stochastic policy. You should keep explore=True.

I am not sure if you have any preprocessors but I think I remember @arturn saying that preprocessors are applied with compute_single_action but not compute_actions.