**How severe does this issue affect your experience of using Ray?**

- None: Just asking a question out of curiosity

More a general RL question than an RLlib one, but I’m curious if anyone here has an idea: I’m dealing with a family of settings where in a large part of the state space, one action is optimal, and in a very small part of the state space another, and none of the algorithms I’ve tried pick up on that small part of the space.

This happens even in very small simple test cases, e.g. a MultiDiscrete(3,8) state space, and just two actions. In tha setting, for 6 out of the 8 possible values of the second component, the agent should always perform action 1; for the other 2 values, it should perform action 2 always.

Now the problem is that with both PG, PPO and DQN, the agent never learns to act correctly in that small part of the state space. This even happens if I multiply out the state space into a single Discrete(24) space, and use a linear policy. In that case, what happens is that the bias if the single linear layer gets so many gradients pointing it toward action 1, that very quickly they overpower the weights for the individual states. If I use a model that’s just a single linear layer without a bias, it works perfectly fine. But of course that only works if I multiply out the multi-discrete state space into a big discrete one, and that doesnt’t allow any generalisation between different states, and won’t scale to bigger instances of the same setting.

Is this a known problem in RL and are there any established techniques for dealing with this kind of problem?

Thanks so much!