Top-K action sampling

I trained a policy with PPO to pick a node from a graph (action distribution of size N).

Now I want to edit it to pick the top-K nodes from the graph, ideally, I could keep the same action distribution. (rather than training a new policy with action dist of shape (N, K))

I don’t need conditional probabilities like in autoregressive, just avoiding picking the same action twice.