Action space for choosing a sequence of items from a bigger sequence


I’m trying to design a game, in which each agent’s action space is choosing a sequence of 5 numbers from 0 to 19, in a certain order, without choosing the same number more than once.

Example actions from this space (written in list notation rather than numpy array):

[4, 15, 3, 2, 11]
[0, 5, 2, 8, 15]

How would you use the spaces provided by RLlib/gym to best express this action space so as to make an algorithm like PPO or Impala learn effective behaviors as easily as possible?

Thanks for your help,
Ram Rachum.

Hi @cool-RR,

Is the agent choosing the 5 numbers in the same step or one at a time over 5 steps?

The agent is choosing the 5 numbers in the same step.

@mannyv Any idea how to tackle this?

@kourosh Could you help here please?

Hi @cool-RR,

I think I would start by creating a new multi-discrete action distribution that captures the ideas from this paper:

Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement

You can try other approaches too but I think more generally I would try an approach that uses sampling without replacement of a multi-discrete categorical distribution that adjusts the entropy, kl, and log_prob appropriately. If you implement something like that then you should be able to use A2C/PPO in rllib without modifications.


Thank you! I’ll give it a try.