Dealing with very imbalanced rewards in most of state space

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

More a general RL question than an RLlib one, but I’m curious if anyone here has an idea: I’m dealing with a family of settings where in a large part of the state space, one action is optimal, and in a very small part of the state space another, and none of the algorithms I’ve tried pick up on that small part of the space.

This happens even in very small simple test cases, e.g. a MultiDiscrete(3,8) state space, and just two actions. In tha setting, for 6 out of the 8 possible values of the second component, the agent should always perform action 1; for the other 2 values, it should perform action 2 always.

Now the problem is that with both PG, PPO and DQN, the agent never learns to act correctly in that small part of the state space. This even happens if I multiply out the state space into a single Discrete(24) space, and use a linear policy. In that case, what happens is that the bias if the single linear layer gets so many gradients pointing it toward action 1, that very quickly they overpower the weights for the individual states. If I use a model that’s just a single linear layer without a bias, it works perfectly fine. But of course that only works if I multiply out the multi-discrete state space into a big discrete one, and that doesnt’t allow any generalisation between different states, and won’t scale to bigger instances of the same setting.

Is this a known problem in RL and are there any established techniques for dealing with this kind of problem?

Thanks so much!

1 Like

Consider using advantage actor critic. These methods normalize the value of an action in a given state by the value of that state. This should help keep the policy from biasing actions in high-value states, allowing the policy to pick up on the nuances you’ve described here. A2C is my algorithm of choice. I’ve run experiments in scenarios similar to what you are describing, and with enough training steps I’ve seen my agents learn to deal with this kind of nuance.

Oh, that’s a good point. I had started to think about manually normalising returns in different parts of the state space separately somehow, but advantage should do exactly that anyway. I’ll try - thank you!