If the below code is
agent = PPOTrainer(config, env=“CartPole-v1”) -----------------------------------(1)
for _ in range(1):
result = agent.train() -------------------------------------------------------------(2)
Does (1) mean, that it has collected the training data of default 4000 batch size ( called the reset, step functions in the environment )
Does (2) mean, that with the data already collected from (1) , the policy is getting changed or trained based on loss calculated ?
I see so many versions of implementation of the above and getting confused.
example:
algo = (
PPOConfig()
.rollouts(num_rollout_workers=1)
.resources(num_gpus=0)
.environment(env=“CartPole-v1”)
.build()
)
for i in range(10):
result = algo.train()
isnt this the same ?