SAC inference action distribution much different than during training

bmanczak · March 2, 2022, 10:55am

Hi all,

I’ve trained a SAC agent on a custom environment with 8 actions. In course of training I was logging the action distribution taken by the agent. During training the probability mass was distributed somewhat uniformly between 4 actions. The rest of the actions were chosen negligible amount of times (less than 5% of the times all together).

To my suprise during post-training evaluation using compute_action the model exclusively chooses a single action.

I’ve looked at the action distribution that is returned when compute_action(..., full_fetch = True) and all the probability mass is on a single action.

Do you have any idea why that might be and what might be some diagnostics to check?

I’ve already tried the following and the problem still persists:

the obseravations are in exactly the same format and scale as during training
the data used in evaluation is the same as during training

arturn · March 10, 2022, 7:09am

Hi @bmanczak ,

Is there any chance that you are passing RLLib a config that includes the following?

"evaluation_config": {
        "explore": False,
    }

As you probably know, this should not be the case for SAC but is typical for Q learning.

bmanczak · March 10, 2022, 11:48am

Hi @arturn, thanks for your suggestion.

I am using that config but I’ve found an explanation for the strange behaviour.

As it turned out, the action distribution was (1) skewed by the samples coming from a prioritised replay buffer and (2) I was logging only the action distribution that changed some part of the state in my environment, again not reflecting the real picture.

However, the fact that the part of the state did not change was strange to me and only then I realised that in the rllib implementation of SAC the (natural) logarithm of the entropy coefficient is used instead of just the coefficient as done in the paper. With the default initial value of alpha of 1, the coefficient is 0.

So for whomever the SAC is stuck on a small subset of actions, try setting the initial_alpha*alpha parameter higher (1.105 which corresponds to $log_e (1.105) \approx 0.1 $) and tuning the entropy_learning_rate.

This issue can be closed.

Topic		Replies	Views
Scaling rewards depending on action distribution RLlib	2	388	November 3, 2021
SAC Agent 'Forgets' During Training RLlib	5	307	September 13, 2022
Mixing simulation and offline data with SAC RLlib	8	463	September 12, 2022
Incredibly large policy entropy RLlib	3	358	November 13, 2021
Confused by output of `compute_log_likelihoods` RLlib	0	324	March 28, 2022

SAC inference action distribution much different than during training

Related topics