Malformed "reparameterization trick" in squashed gaussian

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

So I’ve been doing a lot of debugging with SAC recently and one thorn that sticks in my side is how RLlib employs the “reparameterization trick” for squashed gaussian.

Everywhere I see the trick mentioned it is defined as:

$a = \tanh ({\mu_\theta + \sigma_\theta\odot\xi})$ where $\xi\thicksim N(0,I)$, then mapping to $[low,high]$

However, it seems like RLlib just does:

$a = \tanh ({a_n})$ where $a_n\thicksim N(\mu_\theta,\sigma_\theta\odot I)$ then mapping to $[low,high]$

Can anyone elucidate how these two are functionally the same? What’s interesting is that even though they are ostensibly not using the “reparameterization trick” it seems like SAC still works with automatic differentiation

Hi @grrsausage can you file a github issue for this, please?

It would be great if you could provide a script that demonstrates that the SAC loss we provide is “wrong”. Call it and the “correct” implementation with the same inputs and show that they differ.

This might need some discussion.

So sorry for the delay, @arturn ! I will open an issue now