Dealing with very imbalanced rewards in most of state space

mgerstgrasser · September 27, 2022, 3:30pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

More a general RL question than an RLlib one, but I’m curious if anyone here has an idea: I’m dealing with a family of settings where in a large part of the state space, one action is optimal, and in a very small part of the state space another, and none of the algorithms I’ve tried pick up on that small part of the space.

This happens even in very small simple test cases, e.g. a MultiDiscrete(3,8) state space, and just two actions. In tha setting, for 6 out of the 8 possible values of the second component, the agent should always perform action 1; for the other 2 values, it should perform action 2 always.

Now the problem is that with both PG, PPO and DQN, the agent never learns to act correctly in that small part of the state space. This even happens if I multiply out the state space into a single Discrete(24) space, and use a linear policy. In that case, what happens is that the bias if the single linear layer gets so many gradients pointing it toward action 1, that very quickly they overpower the weights for the individual states. If I use a model that’s just a single linear layer without a bias, it works perfectly fine. But of course that only works if I multiply out the multi-discrete state space into a big discrete one, and that doesnt’t allow any generalisation between different states, and won’t scale to bigger instances of the same setting.

Is this a known problem in RL and are there any established techniques for dealing with this kind of problem?

Thanks so much!

rusu24edward · October 3, 2022, 3:29pm

Consider using advantage actor critic. These methods normalize the value of an action in a given state by the value of that state. This should help keep the policy from biasing actions in high-value states, allowing the policy to pick up on the nuances you’ve described here. A2C is my algorithm of choice. I’ve run experiments in scenarios similar to what you are describing, and with enough training steps I’ve seen my agents learn to deal with this kind of nuance.

mgerstgrasser · October 3, 2022, 3:46pm

Oh, that’s a good point. I had started to think about manually normalising returns in different parts of the state space separately somehow, but advantage should do exactly that anyway. I’ll try - thank you!

Topic		Replies	Views
Value based methods compatible with multi-discrete action space? RLlib	3	621	December 21, 2021
Is any multi discrete action example for PPO or other algorithms? RLlib	9	4383	January 29, 2023
Rainbow/DQN with MultiDiscrete Action Spaces RLlib	2	2435	May 24, 2021
My RLlib implementation seems to compute random actions RLlib	4	919	February 15, 2022
Action space Discrete is not supported for DQN RLlib	0	79	September 28, 2024

Dealing with very imbalanced rewards in most of state space

Related topics