Hi there
I am trying to understand the SoftQ algorithm so I tested it using the tuned params provided for Cartpole and debugged to understand the object flow. I’m a bit confused by where I end up: we take the argmax() of the distribution instead of sample().
I’m trying to figure out if I am doing something wrong or if my understanding of SoftQ is dodgy. I thought in SoftQ we would sample from the current action distribution?
More details:
Softq config used: https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/cartpole-dqn-softq.yaml
Debug flow:
In soft_q.py method get_exploration_action() creates the distribution applying the temperature and then passes to StochasticSampling to perform the actual sampling.
In StochasticSampling’s _get_torch_exploration_action() we end up taking the argmax of the distribution (in action=action_dist.deterministic_sample()).
class TorchCategorical:
def deterministic_sample(self)->TensorType:
self.last_sample=self.dist.probs.argmax(dim=1)
Return self.last_sample
Also the explore flag in StochasticSampling is always False, even if I set it in config to be True, so we always end up at action=action_dist.deterministic_sample()
Thanks,
Perusha