I’m trying to design a game, in which each agent’s action space is choosing a sequence of 5 numbers from 0 to 19, in a certain order, without choosing the same number more than once.
Example actions from this space (written in list notation rather than numpy array):
[4, 15, 3, 2, 11]
[0, 5, 2, 8, 15]
How would you use the spaces provided by RLlib/gym to best express this action space so as to make an algorithm like PPO or Impala learn effective behaviors as easily as possible?
I think I would start by creating a new multi-discrete action distribution that captures the ideas from this paper:
Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement
You can try other approaches too but I think more generally I would try an approach that uses sampling without replacement of a multi-discrete categorical distribution that adjusts the entropy, kl, and log_prob appropriately. If you implement something like that then you should be able to use A2C/PPO in rllib without modifications.