How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
Hi,
I’ve been working with PPO in RLlib, but I’m pretty new to off-policy ones, particularly the SAC.
I’d like to learn about how the SAC algorithm collects data and processes training batches in the RLlib’s implementation.
Below, I’ve summarized what I know so far and outlined some questions I have on the topic.
In my understanding,
SAC
-
Let’s assume:
- Single agent RL, single agent environment
- RLlib’s implementation with the configuration settings
- Configs:
batch_mode
: “truncate_episodes”min_sample_timesteps_per_iteration==1
min_time_s_per_iteration
: sufficiently small to only samplerollout_fragment_length
samples in one rollout.
-
Data collection
- A rollout worker collects
rollout_fragment_length
samples in each of thenum_envs_per_worker
environments. - There are
num_(rollout_)workers
rollout workers in the data collection phase of an SAC iteration. - In total, the following number of samples collected every iteration:
num_total_samples
=num_workers
*num_envs_per_worker
*rollout_fragment_length
. - Put all the samples into the replay buffer. (Let’s skip the maintenance of replay buffer when full as it may be out of scope of the discussion).
- Let’s call this data collection phase as a data_op (i.e. data collection operation).
- A rollout worker collects
-
Train from batches
- If
num_steps_sampled_before_learning_starts
or more samples are collected, then go on the following. Otherwise, skip the following of this phase. - Make a train batch: randomly sample
train_batch_size
(environment) steps data from the buffer. - Train the networks by using the batch.
- Let’s name this ‘make_batch-and-train_nets’ as a train_op.
- A train_op is repeated several times based on the
training_intensity
value. - Let’s define “natural_value” as
train_batch_size
/num_total_samples
(used in the question section).
- If
Questions
-
Are the descriptions in Data collection accurate? If not, please help me get it right.
-
Are the descriptions in Train from batches accurate? If not, I would appreciate any corrections or additional details.
-
Does the
training_intensity
always match the ratio of the number of train_op to the number of data_op carried out in an SAC iteration?
e.g.config ={ "env": "Pendulum-v1", # episode_length: 200 (fixed) # "training_intensity": 1000, "train_batch_size": 200, "rollout_fragment_length": 2, # note: RLlib 2.1.0; 'auto' not working.. "num_workers": 1, "num_envs_per_worker": 1, # "min_sample_timesteps_per_iteration": 1, "min_time_s_per_iteration": 0.001, ... }
This means:
- In one SAC iteration, one data collection (data_op) carried out, ten train batches are created, and each of the ten train batches (train_op) is used in each update of the networks.
- In one SAC iteration, two samples collected and stored in the buffer, and 2000 samples are used in the training phase.
- Overall, when N samples come into the buffer, 1000*N samples are (ramdomly) used for the training. (this example sounds too intensive imo tho…)
-
If the Q3 is correct, what happens when the ratio of
training_intensity
to “natural_ratio” becomes a non-integer? -
min_sample_timesteps_per_iteration
andmin_time_s_per_iteration
are set to 100 and 1, respectively, by default? But in some examplesrollout_fragment_length=1
(withnum_wokers=1
,num_envs_per_worker=1
), which could silently allow the rollout worker to collect 100 or more samples in an iteration. Why they are set to the values in the default SAC config? -
In the training results, I think “num_env_steps_trained_this_iter” shows doubled over what it was supposed to be, given the Q3 was accurate. Is this just because we train two network types (i.e. Q net(s) and policy net)?