Sampling from flattened space

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

Question about how actions are sampled

In implementing an RL algorithm that uses gym’s flatten function, I ran into an error that I don’t get when using RLlib, so I would like to know how RLlib handles this problem. The flatten wrapper converts Discrete to Box as a one-hot encoding. Suppose the original space is Discrete(3), then:

0 maps to [1, 0, 0]
1 maps to [0, 1, 0]
3 maps to [0, 0, 1]

When we sample the action space for random actions, it samples the Box, which can produce any of the eight combination of 0s and 1s in a three-element array, namely:

[0, 0, 0],
[0, 0, 1], *
[0, 1, 0], *
[0, 1, 1],
[1, 0, 0], *
[1, 0, 1],
[1, 1, 0],
[1, 1, 1]

Only three of these eight that I’ve starred are useable in the strict sense of the mapping. The unflatten function for a Discrete space uses np.nonzero(x)[0][0], and here’s at table of what the above arrays map to:

+ ------------------ + ---------------- + --------------------------------------------- +
| In Flattened Space | np.nonzero(x)[0] | np.nonzero(x)[0][0] (aka discrete equivalent) |
+ ------------------ + ---------------- + --------------------------------------------- +
| 0, 0, 0            | Error            | Error                                         |
| 0, 0, 1            | [2]              | 2                                             |
| 0, 1, 0            | [1]              | 1                                             |
| 0, 1, 1            | [1, 2]           | 1                                             |
| 1, 0, 0            | [0]              | 0                                             |
| 1, 0, 1            | [0, 2]           | 0                                             |
| 1, 1, 0            | [0, 1]           | 0                                             |
| 1, 1, 1            | [0, 1, 2]        | 0                                             |
+ ------------------ + ---------------- + --------------------------------------------- +

Implications

Obviously, [0, 0, 0] will fail because there is no nonzero.
Importantly, only one eighth of the random samples will map to 2. One fourth will map to 1, and one half will map to 0. This has some important implications on exploration, especially if action 2 is the “correct action” throughout much of the simulation. I’m very curious why I have not seen this come up before. This type of skewing in the random sampling can have major implications in the way the algorithm explores and learns, and the problem is exacerbated when Discrete(n), n is large.

I never see this error when running with RLlib, so it seems to do something smarter than a raw sampling from the flattened action space. Can someone point me to more information on how RLlib randomly samples the spaces?

Not a high priority question, but it would be nice if someone has some free time to get eyes on this.

Importantly, only one eighth of the random samples will map to 2. One fourth will map to 1, and one half will map to 0

Where does the other eight map to? The random samples don’t appear to be random. Can you set up a reproduction script?

I never see this error when running with RLlib, so it seems to do something smarter than a raw sampling from the flattened action space.

We sample from spaces in many places. I’m not sure which one you are referring to here.
In a one-hot encoding, many of the cases you list are not valid encodings though.
For example [0, 1, 1] is not one-hot, but “two hot”.

Cheers

Thanks for the response, @arturn

When a gym space is flattened, it is converted to a Box. The flattened space does not store information about what it was before being flattened; particularly, it doesn’t retain information about one-hot encoding. Flattening Discrete(3) becomes Box(0, 1, (3,), int). In Discrete(3), there are three possible points: 0, 1, or 2. On the other hand, Box(0, 1, (3,), int) has 8 possible points:

[0, 0, 0],
[0, 0, 1], *
[0, 1, 0], *
[0, 1, 1],
[1, 0, 0], *
[1, 0, 1],
[1, 1, 0],
[1, 1, 1]

Only the 3 that I’ve starred are valid. Thus, if random samples are generated from the flattened space instead of the original space, then there will be invalid points, as you said above. If you look in the table I created above, you can see how each of the points from the Box space will map to the Discrete space during the unflatten process. The distribution is skewed, and one of the points will fail completely.

All this to say, gym spaces do not guarantee flatten -> sample -> unflatten.

Here’s an example script:

from gym.spaces import Discrete, Box
from gym.spaces.utils import flatten_space, unflatten

def to_string(array):
    return ''.join([str(i) for i in array])

discrete_space = Discrete(3)
discrete_sample_counter = {0: 0, 1: 0, 2:0}
for _ in range(10000):
    sample = discrete_space.sample()
    discrete_sample_counter[sample] += 1

print(discrete_sample_counter)

flattened_space = flatten_space(discrete_space)
flattend_sample_counter = {
    '000': 0,
    '001': 0,
    '010': 0,
    '011': 0,
    '100': 0,
    '101': 0,
    '110': 0,
    '111': 0
}
unflattened_sample_counter = {0: 0, 1: 0, 2:0, 'error': 0}
for _ in range(10000):
    sample = flattened_space.sample()
    flattend_sample_counter[to_string(sample)] += 1

    try:
        unflattened_sample = unflatten(Discrete(3), sample)
        unflattened_sample_counter[unflattened_sample] += 1
    except IndexError:
        unflattened_sample_counter['error'] += 1

print(flattend_sample_counter)
print(unflattened_sample_counter)

# Output:
discrete_sample_counter >> {0: 3462, 1: 3333, 2: 3205}
# ^ Notice equal distribution among the three options, each of them is chosen about 1/3 of the time
flattend_sample_counter >> {'000': 1241, '001': 1266, '010': 1211, '011': 1245, '100': 1251, '101': 1279, '110': 1284, '111': 1223}
# ^ Notice equal distribution among the eight options, each of them is chosen about 1/8 of the time
unflattened_sample_counter >> {0: 5037, 1: 2456, 2: 1266, 'error': 1241}
# ^ Notice skewed distribution among the three options. As in the table I showed above, 1/2 of the samples are for 0, 1/4 are for 1, 1/8 for 2, and the other 1/8 are errors.

I’ve not seen the IndexError error while working with RLlib, so I’m curious to know more about how RLlib generates samples. It doesn’t seem to be directly sampling from the flattened space. Does it sample the original space before the pre-process flattening? I’m concerned about the skewing problem too.

FYI, I’ve brought this up to gym, and you can see the discussion here.

The line that we use to one-hot encode observations reads gym.spaces.utils.flatten(self._obs_space, observation).astype(np.float32).
When sampling from Categorical (Discrete for that matter), you can simply take a random sample and use argmax.

Thanks for the response and for the links. Flattening the point is fairly straightforward, and what I’m bringing up relates to unflattening a sample. I dug around the code a bit, and it looks like the original structure of the space is maintained and utilized for unflattening, so that’s good.