Read Tune console output from Simple Q

Hi @Lars_Simon_Zehnder,

You were right when you were initially trying to relate number of episode steps used during the rollout and training phases. You correctly realized that they alternate back and forth between each phase but I think where you may have had a misunderstanding is on how often those steps occur. At least when I was initially looking at it my first assumption was that it would sample “timesteps_per_iteration” steps during rollout and then do one update but that is not how it works.

Assuming we are using batch_mode=“truncate_episodes”,
During the rollout phase it will collect max(num_workers,1) * rollout_fragment_length new samples. That is checked here:

Then it will do the training step. It will also collect 1000 steps before training at all.

So you have the following:
In the first iteration you collect 4 samples 250 times and train 1 time with 32 steps.
On each iteration after that you will collect 1000 new samples in the rollout phase in chunks of 4 250 times and you will train 32 steps 250 times.

steps_sampled = 439 * 1000 = 439_000
steps_trained = 32 + 438 * 32*250 = 3_504_032

OK so target_updates which should happen every 500 ts is a little tricky. You would expect the following:
target_updates = 1 + 438 * 2 = 877 but we only get 870. Why.

Well you look at the code:

Line 392 uses > not >= so you are shifting the target by max(num_workers,1) *rollout_fragment_length every time.

So the update time ends up being:

target_updates = 1 + 438 * 2 - (438 * 2 * (1 * 4) / 500) = 870

Here is the code that skips training for learning_starts steps:

One last thing about replay_sequence_length since you mentioned it a couple times.

The replay buffer stores buffer_size samples. The question is samples of what size? Well it stores buffer_size samples of length replay_sequence_length.

This happens here:

If you sample from the buffer then you will get train_batch_size samples. These can and usually will be non-contiguous in time but each of these samples will be replay_sequence_length long and that sub sequence will be temporally contiguous. Usually, but not strictly required, this is used for memory or attention and the replay_sequence_length will be the same as max_sequence_length.

Hope this helps everything make sense.


3 Likes