Accessing the memory buffer dqn

Hi,
When using a dqn agents (or any other relevant algorithms for that matter) - is there a way that I can manipulate the agent’s memory buffer during the training?

*By manipulation I intend to adding or removing transitions from/to the buffer.

Hi @Ofir_Abu ,

this manipulation is certainly not trivial. As you can see in the source code, the ReplayBuffer holds in _storage a list of SampleBatches that contain the experiences from the environment. This means, you either need to add (by using add() or remove SampleBatches directly from this buffer.

Hi, I am interested in learning how to customize policies/models by reading DQN’s code (because the official RLlib documentation is really hard to follow). However, I feel pretty confused when reading it.

Do you have any suggestions on where I should start to read?
Should I have a strong TensorFlow or PyTorch background?

@Roller, could you start a new topic?

Yes. Sorry. I have started a new topic. Here is the link, if you are interested in it.

Hi, sorry for reopening the issue, I need a technical help.
Given a custom model inheriting from RecurrentTFModelV2 and a trainer achieved from trainer_template.build_trainer and A3CTFPolicy - how can I access the replay_buffer that the actual training is based on?

Thanks in advance!

@Ofir_Abu ,

you started with DQN and DQN works with a replay buffer to estimate the Q* function. For DQN you therefore find in the execution_plan() for DQN a local_replay_buffer that collects the data and replays it for network training. DQN is an offline algorithm.

In contrast A3C is an online algorithm that collects data in the environment and directly trains on it. It also estimates something different than DQN: instead Q* it goes for Q^\pi. The difference lays mathematically in the Bellman equations used. DQN uses the Bellman optimality equation whereas A3C uses the Bellman expectation equation. The latter needs the expectation evaluated in regard to the actual policy - this is turn would bias estimates if old samples (collected with old policies) would be used. Therefore A3C does not use replay.

Hope this could clarify it a little

As a heads up - local replay buffer will probably be named MultiAgentReplayBuffer in the future.

In replay to this: Each MultiAgentReplayBuffer uses a list of PrioritizedReplayBuffers (for each policy one). I guess this means, if standard replay (no prioritized sweeping) should be used we have to set the prioritized_replay_alpha attribute in the config to 0.0? The default is 0.6 in the PrioritizedReplayBuffer ctor.

Exactly, no prioritization is achieved with prioritized_replay_alpha=0.

1 Like

I have to add:

All Replay buffers appear to use under the hood the PrioritizedReplayBufferand this requests an alpha>0 which makes no replay impossible via this way.