Read Tune console output from Simple Q

Hi everyone,

I run SimpleQTrainer with Tune using the following script:

import ray
from ray import tune
import ray.rllib.agents.dqn as dqn

config = dqn.SIMPLE_Q_DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 0

tune_config = {
    "env": "Breakout-v0",
    "model": config
}, config=tune_config)

Everything seems to run nice, but I try to understand the console output from Tune and RLlib.

1. First of all: I get a couple of times the exactly same output in a row from Tune:

I can’t understand where this is coming from? From different environments of the VectorEnv? The number of such identical outputs variates around 5. What is the explanation for this behavior?

2. The output of the metrics shows me the following:

Bildschirmfoto vom 2021-10-18 17-38-32

num_steps_sampled is clear: I am at iteration 439 and in each iteration timesteps_per_iteration=1000 steps get sampled. However, the num_steps_trained is unclear to me.
As far as I understood from the source code, the sampling and training operations are alternated ("round_robin" mode) and a call to learn_on_batch() only happens after 1000 steps got sampled? Why is num_steps_trained then so high? Especially where the train_batch_size=32 * replay_sequence_length=1 is significantly smaller than timesteps_per_iteration=1000?

3. How can I interpret in the last image, num_target_updates? From the DEFAULT_CONFIG I read that each 500th iteration the target network is also updated. This fits almost the output:

(num_steps_sampled - learning_starts - target_update_freq) 
      / target_update_freq = 875

so almost num_target_updates=870. Where comes the difference from?

Thanks for taking time

Hi @Lars_Simon_Zehnder,

Regarding 2. and 3. I have an experiment for you. Change the rollout_fragment_length to be equal to the train_batch_size and see what happens.

Hi @mannyv ,

thanks for the reply. I tried that out and it helped a lot to understand. Thanks!! I saw at first that the worker collects 1024=32*32 steps. Makes sense to me: The rollout_fragment_length must be train_batch_size times collected.

Regarding 2.: The Trainer then starts training after training_starts=1000 steps sampled, which has been done in the first iteration. It trains a train_batch_size with replay_sequence_length=1 (I guess this parameter simply defines how many contiguous steps should be resampled - in this case all samples can be non-contiguous).

Now, timesteps_per_iteration=1000 and so in each iteration at least 1000 steps should be trained with a train_batch_size=32. The smallest multiple of train_batch_size greater than timesteps_per_iteration equals -(-(1000)//32)*32=1024, so each training iteration trains on 1024 steps (32 batches). As training is allowed only after learning_starts steps the Trainer is in the first period only allowed to train on a single batch, as this is the smallest mutliple of 32 in 1024-1000=24. Where do I find this in the source code?

Regarding 3.: target_network_update_freq=500 which means everytime num_train_steps % 500 == 0 holds, the target network also gets updated. This is in the first iteration once and thereafter two times per train iteration the case. Here, I also would like to take a look at the source code …

I am still thinking about the multiple identical console outputs from Tune under 1..

Thanks for your time, @mannyv !

Hi Lars,

regarding 1), Tune outputs a status table every 5 seconds. If no result is received, the table will remain the same.

In the latest master, we introduced a timer, with this you would see that time progresses:

== Status ==
Current time: 2021-10-20 09:14:28 (running for 00:00:52.47)
Memory usage on this node: 11.2/16.0 GiB
1 Like

Hi @Lars_Simon_Zehnder,

You were right when you were initially trying to relate number of episode steps used during the rollout and training phases. You correctly realized that they alternate back and forth between each phase but I think where you may have had a misunderstanding is on how often those steps occur. At least when I was initially looking at it my first assumption was that it would sample “timesteps_per_iteration” steps during rollout and then do one update but that is not how it works.

Assuming we are using batch_mode=“truncate_episodes”,
During the rollout phase it will collect max(num_workers,1) * rollout_fragment_length new samples. That is checked here:

Then it will do the training step. It will also collect 1000 steps before training at all.

So you have the following:
In the first iteration you collect 4 samples 250 times and train 1 time with 32 steps.
On each iteration after that you will collect 1000 new samples in the rollout phase in chunks of 4 250 times and you will train 32 steps 250 times.

steps_sampled = 439 * 1000 = 439_000
steps_trained = 32 + 438 * 32*250 = 3_504_032

OK so target_updates which should happen every 500 ts is a little tricky. You would expect the following:
target_updates = 1 + 438 * 2 = 877 but we only get 870. Why.

Well you look at the code:

Line 392 uses > not >= so you are shifting the target by max(num_workers,1) *rollout_fragment_length every time.

So the update time ends up being:

target_updates = 1 + 438 * 2 - (438 * 2 * (1 * 4) / 500) = 870

Here is the code that skips training for learning_starts steps:

One last thing about replay_sequence_length since you mentioned it a couple times.

The replay buffer stores buffer_size samples. The question is samples of what size? Well it stores buffer_size samples of length replay_sequence_length.

This happens here:

If you sample from the buffer then you will get train_batch_size samples. These can and usually will be non-contiguous in time but each of these samples will be replay_sequence_length long and that sub sequence will be temporally contiguous. Usually, but not strictly required, this is used for memory or attention and the replay_sequence_length will be the same as max_sequence_length.

Hope this helps everything make sense.

1 Like

@kai thanks a lot for clarifying this! I like the new approach for the upcoming release. It drove me a little insecure, if this is the right behavior or if I had to make some modifications to the code …

Hi @mannyv ,

that is an amazingly detailed answer! Thanks for the clarification. I did indeed misunderstand how samples are collected in rollouts. So, as far as I understand it, the timesteps_per_iteration gives a lower bound for the steps to be sampled and if this latter number cannot be divided evenly by max(num_workers, 1) * rollout_fragment_length) it could also overshoot this number.

What I do not yet fully understand (only intuitively - but I read deeper into the code) is where the Trainer gets the 250 train steps from. Somehow this might be calculated via the timesteps_per_iteration and the num_workers?

Regarding the num_update_target count - this is tricky indeed! I don’t know, if I would have come up with this explanation without getting this hint.

In regard to the replay_sequence_length I assume that a sequence of length 1 is actually one step in the environment with (s_t, a_t, r_{t+1}, s_{t+1}) like in the original paper by Minh et al. (2013). Correct me, if I am wrong.

Great links to the source code. This helps me a lot! Thanks for your time!

Hi @Lars_Simon_Zehnder,

Yes the timesteps_per_iteration expresses a lower bound on the number of new steps sampled. You can collect more timesteps of they do not divide evenly. Keep in mind though that the frequency of an iteration is really more of a decision of how often you want new metrics reported and checkpointing. There are usually sampling and traing cycles within one iteration.

The 250 is not computed explicitly. Tune will call the trainers train function repeatedly until the iteration criteria (timesteps_per_iteration and min_iteration_ms) are satisfied. It just so happens that when pulling 4 samples at a time it takes 250 calls. If the rollout glfragment length was 6 then I would expect to see 167 cycles per iteration.

Yes you are correct the replay buffer will hold all the values collected for each environment transition. The ones you listed plus a few more I think. It will depend on what ViewRequirements are set.

One last thing to note. The description in the previous post is about the off policy algorithms that use a replay buffer. The on policy ones like PPO and A2C are largely the same but have some slight differences. The main one is that samples are collected in num_workers *rollout_fragment_length chunks until the training_batch_size is reached then the training op is run.

1 Like

This is a perfect clarification! Thank you @mannyv ! And respect for reading so much through the source code :100: