Offline training using previous obs+action=reward tuples


Is it possible to store a tuple of (obs, action, reward) to then use for training models? This is mainly in the case of changing hyperparameters and instead of rerunning expensive models/environments we can use previous data to speed up training to a degree.

Thanks in advance,
Denys A.

Hey @Denys_Ashikhin , yes, this is usually don by our off-policy algorithms, like DQN, SAC, DDPG, CQL, and TD3.
If you look at their execution plans (e.g. ray/rllib/agents/dqn/, you will see that we create a LocalReplayBuffer in there that’s used for storing experience tuples from the environment rollouts and re-use the samples therein repeatedly for the training updates.