How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
Question about how actions are sampled
In implementing an RL algorithm that uses gym’s flatten function, I ran into an error that I don’t get when using RLlib, so I would like to know how RLlib handles this problem. The flatten wrapper converts Discrete to Box as a one-hot encoding. Suppose the original space is Discrete(3), then:
0 maps to [1, 0, 0]
1 maps to [0, 1, 0]
3 maps to [0, 0, 1]
When we sample the action space for random actions, it samples the Box, which can produce any of the eight combination of 0s and 1s in a three-element array, namely:
[0, 0, 0],
[0, 0, 1], *
[0, 1, 0], *
[0, 1, 1],
[1, 0, 0], *
[1, 0, 1],
[1, 1, 0],
[1, 1, 1]
Only three of these eight that I’ve starred are useable in the strict sense of the mapping. The unflatten function for a Discrete space uses np.nonzero(x)[0][0]
, and here’s at table of what the above arrays map to:
+ ------------------ + ---------------- + --------------------------------------------- +
| In Flattened Space | np.nonzero(x)[0] | np.nonzero(x)[0][0] (aka discrete equivalent) |
+ ------------------ + ---------------- + --------------------------------------------- +
| 0, 0, 0 | Error | Error |
| 0, 0, 1 | [2] | 2 |
| 0, 1, 0 | [1] | 1 |
| 0, 1, 1 | [1, 2] | 1 |
| 1, 0, 0 | [0] | 0 |
| 1, 0, 1 | [0, 2] | 0 |
| 1, 1, 0 | [0, 1] | 0 |
| 1, 1, 1 | [0, 1, 2] | 0 |
+ ------------------ + ---------------- + --------------------------------------------- +
Implications
Obviously, [0, 0, 0] will fail because there is no nonzero.
Importantly, only one eighth of the random samples will map to 2. One fourth will map to 1, and one half will map to 0. This has some important implications on exploration, especially if action 2 is the “correct action” throughout much of the simulation. I’m very curious why I have not seen this come up before. This type of skewing in the random sampling can have major implications in the way the algorithm explores and learns, and the problem is exacerbated when Discrete(n), n is large
.
I never see this error when running with RLlib, so it seems to do something smarter than a raw sampling from the flattened action space. Can someone point me to more information on how RLlib randomly samples the spaces?