I am trying to understand the SoftQ algorithm so I tested it using the tuned params provided for Cartpole and debugged to understand the object flow. I’m a bit confused by where I end up: we take the argmax() of the distribution instead of sample().
I’m trying to figure out if I am doing something wrong or if my understanding of SoftQ is dodgy. I thought in SoftQ we would sample from the current action distribution?
In soft_q.py method get_exploration_action() creates the distribution applying the temperature and then passes to StochasticSampling to perform the actual sampling.
In StochasticSampling’s _get_torch_exploration_action() we end up taking the argmax of the distribution (in action=action_dist.deterministic_sample()).
Also the explore flag in StochasticSampling is always False, even if I set it in config to be True, so we always end up at action=action_dist.deterministic_sample()