All or nothing (Explore or sample) actions - correct for each step?

How to know which of the actions were randomly chosen (stochastic) and which ones were deterministic? I am digging into StochasticSampling as a choice for the mulit-agent PPO algorithm I am currently running - and I had a question about this line:

 stochastic_actions = tf.cond(
            pred=tf.convert_to_tensor(ts < self.random_timesteps),
            true_fn=lambda: (
                self.random_exploration.get_tf_exploration_action_op(
                    action_dist, explore=True
                )[0]
            ),
            false_fn=lambda: action_dist.sample(),
        )

Here, if the predicate is true, then all the agents in that step will receive an exploratory action (instead of a sample from the action distribution)? And if it is false, then every agent will perform an action based on the sample from the action distribution.

But isn’t it better to make it act as: “each agent has a chance to do either exploratory action or sample from the distribution”. Sorry, my understanding may not be accurate. But if it is, then would it be wise to modify this to make every agent have a change to either explore or sample from distribution in that step - how would I go about this?

Please advise. Thanks.

Hi @hridayns,

If you look at the arguments in init there is an argument called random_timesteps. This is a configuration parameter you can set in the exploration config. It indicates how many steps you want the policy to act completely randomly at the beginning of training. For PPO the default is 0 which means you will not hit the true branch of that conditional.

Yes, I understand that now. So it takes a sample from the action distribution at every step. But would it be wise to modify it such that at every step, there is a 50% chance of it sampling from an action distribution and 50% chance of it performing a completely random action?