# Scaling rewards depending on action distribution

I’m training an agent to trade in the market. The agent makes binary actions on every step (buy or sell). During training the agent tends to choose either one of the actions all of the time (either buy and hold, or sell and stay in cash). This results in reward being close to 0. It seems like the agent doesn’t explore the environment for better policies. And I know that there is a better policy because I trained one using more conventional modeling tool (i.e. one that buys and sells a lot, and turns a profit).

Is there a common name for such a problem where the agent chooses not to die by not taking any risks?

Question 1: How can I help the agent to explore the environment better? I’ve been thinking about penalizing the reward in case the action distribution is skewed to either side. Are there any other ways?

Question 2: If I’m going for the reward penalty option, what API is most suitableto implement this? I.e. If `acs` is a Dict with action counts: `{0: 100, 1: 10}`, I want to decrease the reward by some constant, e.g. `(100^2 + 10^2) / (100 + 10)^2`

Answer 1: You need to design your reward signal to best as possible reflect what you want your agent to learn. You have not detailed how you reward your agent. What agent do you use? And how is it rewarded?

Answer 2: To the best of my knowledge, there is only one “place” in RLLib that does something similar to what you are trying to accomplish. Which is entropy regularization (see actor critic algorithms and the `entropy_coeff` paramter in the RLlib algorithms). You can heavily punish policies with narrow action distributions with it. But be aware that this hinders the convergence towards the optimal policy in the end, since the agent always explores and never exploits. This is why you might want to employ an entropy coefficient schedule - also possible in RLlib.
Furthermore, I would not recommend scaling the rewards “by hand” because it is less expressive than penalizing the action distribution in your loss function. Have a look at the rllib implementations of loss functions. You can put your own there if you want to do this experiment.

Cheers

You have not detailed how you reward your agent. What agent do you use? And how is it rewarded?

The agent makes buy/sell decisions, and the reward is the profit/loss accrued during the step. This is the most natural way to specify the reward for an automated trading agent. Although possibly not the most optimal for a reinforcement learning system… I’m trying various options.

`entropy_coeff`

Will look into this, thanks!