SoftQ takes argmax?

Hi there

I am trying to understand the SoftQ algorithm so I tested it using the tuned params provided for Cartpole and debugged to understand the object flow. I’m a bit confused by where I end up: we take the argmax() of the distribution instead of sample().

I’m trying to figure out if I am doing something wrong or if my understanding of SoftQ is dodgy. I thought in SoftQ we would sample from the current action distribution?

More details:

Softq config used: https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/cartpole-dqn-softq.yaml

Debug flow:
In soft_q.py method get_exploration_action() creates the distribution applying the temperature and then passes to StochasticSampling to perform the actual sampling.

In StochasticSampling’s _get_torch_exploration_action() we end up taking the argmax of the distribution (in action=action_dist.deterministic_sample()).

class TorchCategorical:
def deterministic_sample(self)->TensorType:
self.last_sample=self.dist.probs.argmax(dim=1)
Return self.last_sample

Also the explore flag in StochasticSampling is always False, even if I set it in config to be True, so we always end up at action=action_dist.deterministic_sample()

Thanks,
Perusha

Hi @Perusha,

Welcome to the forums. Which version of ray are you using?
I am not seeing the same results as you. What I am seeing is that the configuration you posted is using deterministic actions once during initialization with a dummy batch of fake data and then always uses the stochastic actions when training with real data.

I made a colab notebook that prints which type of actions are being used for you to explore here: Google Colab

Feel free to reach out with more questions.


Hi @mannyv
Thanks for the welcome and for responding!

I just installed Ray and Rllib last week so I am pretty new to it, but I should have the latest versions.
You’re right… I am baffled by the first few deterministic runs but it seems to be stochastic after that!!

During the debug I landed in the deterministic section and then just assumed it was always going to be deterministic. Thanks for checking and for the colab! At least I know it’s working the way I want now :grinning:

Thanks for taking the time to take a look - much appreciated!

Best,
Perusha