Read Tune console output from Simple Q

mannyv · October 20, 2021, 1:46pm

You were right when you were initially trying to relate number of episode steps used during the rollout and training phases. You correctly realized that they alternate back and forth between each phase but I think where you may have had a misunderstanding is on how often those steps occur. At least when I was initially looking at it my first assumption was that it would sample “timesteps_per_iteration” steps during rollout and then do one update but that is not how it works.

Assuming we are using batch_mode=“truncate_episodes”,
During the rollout phase it will collect max(num_workers,1) * rollout_fragment_length new samples. That is checked here:

github.com

ray-project/ray/blob/9dba5e0eadd3a065023d6cc7cafff631355c980a/rllib/evaluation/rollout_worker.py#L765-L774

    
      
          while (steps_so_far < self.rollout_fragment_length
                 and len(batches) < max_batches):
              batch = self.input_reader.next()
              steps_so_far += batch.count if \
                  self.count_steps_by == "env_steps" else \
                  batch.agent_steps()
              batches.append(batch)
          batch = batches[0].concat_samples(batches) if len(batches) > 1 else \
              batches[0]

Then it will do the training step. It will also collect 1000 steps before training at all.

So you have the following:
In the first iteration you collect 4 samples 250 times and train 1 time with 32 steps.
On each iteration after that you will collect 1000 new samples in the rollout phase in chunks of 4 250 times and you will train 32 steps 250 times.

steps_sampled = 439 * 1000 = 439_000
steps_trained = 32 + 438 * 32*250 = 3_504_032

OK so target_updates which should happen every 500 ts is a little tricky. You would expect the following:
target_updates = 1 + 438 * 2 = 877 but we only get 870. Why.

Well you look at the code:

github.com

ray-project/ray/blob/9dba5e0eadd3a065023d6cc7cafff631355c980a/rllib/execution/train_ops.py#L388-L397

    
      
          def __call__(self, _: Any) -> None:
              metrics = _get_shared_metrics()
              cur_ts = metrics.counters[self.metric]
              last_update = metrics.counters[LAST_TARGET_UPDATE_TS]
              if cur_ts - last_update > self.target_update_freq:
                  to_update = self.policies or self.local_worker.policies_to_train
                  self.workers.local_worker().foreach_trainable_policy(
                      lambda p, p_id: p_id in to_update and p.update_target())
                  metrics.counters[NUM_TARGET_UPDATES] += 1
                  metrics.counters[LAST_TARGET_UPDATE_TS] = cur_ts

Line 392 uses > not >= so you are shifting the target by max(num_workers,1) *rollout_fragment_length every time.

So the update time ends up being:

target_updates = 1 + 438 * 2 - (438 * 2 * (1 * 4) / 500) = 870

Here is the code that skips training for learning_starts steps:

github.com

ray-project/ray/blob/9dba5e0eadd3a065023d6cc7cafff631355c980a/rllib/execution/replay_buffer.py#L521-L522

    
      
          if self.num_added < self.replay_starts:
              return None

One last thing about replay_sequence_length since you mentioned it a couple times.

The replay buffer stores buffer_size samples. The question is samples of what size? Well it stores buffer_size samples of length replay_sequence_length.

This happens here:

github.com

ray-project/ray/blob/9dba5e0eadd3a065023d6cc7cafff631355c980a/rllib/execution/replay_buffer.py#L483-L512

    
      
          with self.add_batch_timer:
              # Lockstep mode: Store under _ALL_POLICIES key (we will always
              # only sample from all policies at the same time).
              if self.replay_mode == "lockstep":
                  # Note that prioritization is not supported in this mode.
                  for s in batch.timeslices(self.replay_sequence_length):
                      self.replay_buffers[_ALL_POLICIES].add(s, weight=None)
              else:
                  for policy_id, sample_batch in batch.policy_batches.items():
                      if self.replay_sequence_length == 1:
                          timeslices = sample_batch.timeslices(1)
                      else:
                          timeslices = timeslice_along_seq_lens_with_overlap(
                              sample_batch=sample_batch,
                              zero_pad_max_seq_len=self.replay_sequence_length,
                              pre_overlap=self.replay_burn_in,
                              zero_init_states=self.replay_zero_init_states,
                          )
                      for time_slice in timeslices:
                          # If SampleBatch has prio-replay weights, average

This file has been truncated. show original

If you sample from the buffer then you will get train_batch_size samples. These can and usually will be non-contiguous in time but each of these samples will be replay_sequence_length long and that sub sequence will be temporally contiguous. Usually, but not strictly required, this is used for memory or attention and the replay_sequence_length will be the same as max_sequence_length.

Hope this helps everything make sense.

Topic		Replies	Views
RLLIB not working with Tune with sample batch input RLlib	25	2608	October 4, 2022
Ray Tune process hangs - thread synchronization issue? Ray Tune	16	767	February 10, 2023
Ray tune not logging episode metrics with SampleBatch input RLlib	13	1257	August 9, 2022
Why do my tune runs have the same outputs across all iterations?	6	570	March 8, 2023
Tune.run() runs more iterations than `training_iteration` RLlib	5	507	July 8, 2021

Read Tune console output from Simple Q

Related topics