SAC inference action distribution much different than during training

Hi all,

I’ve trained a SAC agent on a custom environment with 8 actions. In course of training I was logging the action distribution taken by the agent. During training the probability mass was distributed somewhat uniformly between 4 actions. The rest of the actions were chosen negligible amount of times (less than 5% of the times all together).

To my suprise during post-training evaluation using compute_action the model exclusively chooses a single action.

I’ve looked at the action distribution that is returned when compute_action(..., full_fetch = True) and all the probability mass is on a single action.

Do you have any idea why that might be and what might be some diagnostics to check?

I’ve already tried the following and the problem still persists:

  • the obseravations are in exactly the same format and scale as during training
  • the data used in evaluation is the same as during training

Hi @bmanczak ,

Is there any chance that you are passing RLLib a config that includes the following?

"evaluation_config": {
        "explore": False,
    }

As you probably know, this should not be the case for SAC but is typical for Q learning.

Hi @arturn, thanks for your suggestion.

I am using that config but I’ve found an explanation for the strange behaviour.

As it turned out, the action distribution was (1) skewed by the samples coming from a prioritised replay buffer and (2) I was logging only the action distribution that changed some part of the state in my environment, again not reflecting the real picture.

However, the fact that the part of the state did not change was strange to me and only then I realised that in the rllib implementation of SAC the (natural) logarithm of the entropy coefficient is used instead of just the coefficient as done in the paper. With the default initial value of alpha of 1, the coefficient is 0.

So for whomever the SAC is stuck on a small subset of actions, try setting the initial_alpha*alpha parameter higher (1.105 which corresponds to $log_e (1.105) \approx 0.1 $) and tuning the entropy_learning_rate.

This issue can be closed.

1 Like