Understanding SAC: Data Collection and Training

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity




Hi,

I’ve been working with PPO in RLlib, but I’m pretty new to off-policy ones, particularly the SAC.
I’d like to learn about how the SAC algorithm collects data and processes training batches in the RLlib’s implementation.

Below, I’ve summarized what I know so far and outlined some questions I have on the topic.


In my understanding,

SAC

  • Let’s assume:

    • Single agent RL, single agent environment
    • RLlib’s implementation with the configuration settings
    • Configs:
      • batch_mode: “truncate_episodes”
      • min_sample_timesteps_per_iteration==1
      • min_time_s_per_iteration : sufficiently small to only sample rollout_fragment_length samples in one rollout.
  • Data collection

    • A rollout worker collects rollout_fragment_length samples in each of the num_envs_per_worker environments.
    • There are num_(rollout_)workers rollout workers in the data collection phase of an SAC iteration.
    • In total, the following number of samples collected every iteration:
      num_total_samples = num_workers * num_envs_per_worker * rollout_fragment_length.
    • Put all the samples into the replay buffer. (Let’s skip the maintenance of replay buffer when full as it may be out of scope of the discussion).
    • Let’s call this data collection phase as a data_op (i.e. data collection operation).
  • Train from batches

    • If num_steps_sampled_before_learning_starts or more samples are collected, then go on the following. Otherwise, skip the following of this phase.
    • Make a train batch: randomly sample train_batch_size (environment) steps data from the buffer.
    • Train the networks by using the batch.
    • Let’s name this ‘make_batch-and-train_nets’ as a train_op.
    • A train_op is repeated several times based on the training_intensity value.
    • Let’s define “natural_value” as train_batch_size/num_total_samples (used in the question section).



Questions

  1. Are the descriptions in Data collection accurate? If not, please help me get it right.

  2. Are the descriptions in Train from batches accurate? If not, I would appreciate any corrections or additional details.

  3. Does the training_intensity always match the ratio of the number of train_op to the number of data_op carried out in an SAC iteration?
    e.g.

    config ={   
      "env": "Pendulum-v1",  # episode_length: 200 (fixed)
      #
      "training_intensity": 1000, 
      "train_batch_size": 200,
      "rollout_fragment_length": 2,  # note: RLlib 2.1.0; 'auto' not working..
      "num_workers": 1,
      "num_envs_per_worker": 1, 
      #
      "min_sample_timesteps_per_iteration": 1,
      "min_time_s_per_iteration": 0.001,
      ...
    }
    

    This means:

    • In one SAC iteration, one data collection (data_op) carried out, ten train batches are created, and each of the ten train batches (train_op) is used in each update of the networks.
    • In one SAC iteration, two samples collected and stored in the buffer, and 2000 samples are used in the training phase.
    • Overall, when N samples come into the buffer, 1000*N samples are (ramdomly) used for the training. (this example sounds too intensive imo tho…)
  4. If the Q3 is correct, what happens when the ratio of training_intensity to “natural_ratio” becomes a non-integer?

  5. min_sample_timesteps_per_iteration and min_time_s_per_iteration are set to 100 and 1, respectively, by default? But in some examples rollout_fragment_length=1 (with num_wokers=1, num_envs_per_worker=1), which could silently allow the rollout worker to collect 100 or more samples in an iteration. Why they are set to the values in the default SAC config?

  6. In the training results, I think “num_env_steps_trained_this_iter” shows doubled over what it was supposed to be, given the Q3 was accurate. Is this just because we train two network types (i.e. Q net(s) and policy net)?