Mixing simulation and offline data with SAC

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello Ray RL community,

I am using a custom environment in a SACTrainer to teach a car model how to drive on a straight line at its maximum speed.

I am trying to ‘boost’ the learning by collecting experiences with the environment coupled with a more traditional controller and saving the data as suggested in:
Working With Offline Data — Ray 1.13.0

Once I point my SACTrainer to the offline-generated data by setting the config file as suggested in the previous link:
“input”: {
path_to_my_offline_generated_data: 0.1,
“sampler”: 0.9,
and start the training, I get an error stating that the SampleBatch generated by my external simulation is missing the field “action_dist_inputs”, which I believe it is required by SACTrainer.
I worked around this error by setting this field to a list of zeros, but I assume this is conceptually wrong.

I tried to search online the meaning of action_dist_inputs but I had no luck so far.
It would be interesting to understand the meaning of this variable in order to set it to its correct value when I generate offline data.


Hi @Manna ,

Sorry for getting to this so late. action_dist_inputs are the direct outputs of your ANN.
You will usually have an action distribution class determined by ModelCatalog.get_action_dist(). This will output the actual actions taken by the agent. Whether your offline data contains this field depends much on how you collect the data.
For training, this should be all right because SAC’s loss function does not take ACTION_DIST_INPUTS from the batch but calculates them at learning-time itself.



Hey @Manna , thanks for raising this.

Yeah, @arturn described it correctly. To give you an example:
For action space e.g. gym.spaces.Box(-1.0, 1.0, (4, )), the space of action_dist_inputs would be: gym.spaces.Box(-1.0, 1.0, (8, )). Yes: 8 :slight_smile:
B/c SAC is using the SquashedGaussian distribution, it expects 4 values for the means (of your 4 action dims) and 4 values of log(std) for your 4 action dims.

And your ANN in this case would have 8 linear output nodes (RLlib builds this for you).

1 Like

Thank you very much for your answer @arturn.
My offline data is generated by stepping the environment and using a proportional controller to generate the feedback.
Does it mean that when I start training my SAC model it does not matter what values I write into action_dist_inputs during the offline data generation, i.e. could I just set it to a list of zeros?

Thank you!

Yes. These values will not be used by SAC when learning on the batch.

@arturn, @sven1977 thank you so much for the help!

I have one last question, maybe a bit off topic but still related to the mixing of simulated and offline data with SAC.

In my understanding, SAC sets the target entropy to the actions space dimension by default (in my case this is 4). Alpha loss should then converge to this value.
This is true when I train my model without using the mix of offline and simulated data, but it changes when I use the mixing, with the alpha loss converging to zero. Does this make sense or am I not setting the training configuration correctly?

alpha_loss with mix


@Manna ,

The target entropy is the product of the action space dimensions by default (if you set it to “auto”).
I’m not 100% sure about the explanation I’m going to give but I’ll give it nonetheless - just don’t take my word for it. :slight_smile:
No mixing of input data: As your policy converges, the replay buffer will be filled with data that is roughly on-policy. Your policy is expected to have a very low entropy on such data after it has converged. That is why the alpha loss will be close to 4.
Mixing of input data: A portion of the data you train/evaluate may always be very off-policy. Entropy is therefore expected not to be close to zero and therefore also the alpha loss.
What confuses me in the context of this explanation though is that it converges to zero, even though your mixin-ratio is 9/1.

Have you made any progress on this?

Hi @arturn,

Thanks for the explanation, it makes sense on a high level. I am also confused why the alpha_loss converges to very low values when I am mixing data from a completely different policy.

I wouldn’t say that I made progress, but letting the model train for longer number of steps, then something strange happens - when the reward seems to reach its maximum value, the performance (reward) drops suddenly and the alpha_loss jumps back to non-zero values. Almost like the policy is forgetting what it has been learning until that point.
I can try to keep training the model but it looks like it falls into a state where it fails very quickly in the episode and it is not trying to move from it.

alpha_loss episode_reward

Hi @Manna ,

Maybe just an instance of “catastrophic forgetting”? The rewards look pretty unstable. What about mean and max? Have you tuned some parameters? Have a look at a) the default SAC config and b) at a tuned example that is closest to your case (especially discrete or continuous action space!). How do the other SAC losses behave at the same time? Maybe you can have a look at tau or the target network update frequency.


1 Like