I’m trying to design a game, in which each agent’s action space is choosing a sequence of 5 numbers from 0 to 19, in a certain order, without choosing the same number more than once.

Example actions from this space (written in list notation rather than numpy array):

[4, 15, 3, 2, 11]
[0, 5, 2, 8, 15]

How would you use the spaces provided by RLlib/gym to best express this action space so as to make an algorithm like PPO or Impala learn effective behaviors as easily as possible?

I think I would start by creating a new multi-discrete action distribution that captures the ideas from this paper:

Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement

You can try other approaches too but I think more generally I would try an approach that uses sampling without replacement of a multi-discrete categorical distribution that adjusts the entropy, kl, and log_prob appropriately. If you implement something like that then you should be able to use A2C/PPO in rllib without modifications.