My team is using PPO with default exploration setting and a stopping based on stability of episode reward mean. When we increased number of workers, it decreased convergence time initially but then it started increasing back.
Hey @Saurabh_Arora , great question. This kind of makes sense. PPO is a synchronous learner, meaning all workers’ sample() calls are executed in parallel and collected before(!) the learning step happens from a concatenation of all the collected samples. What happens if you have more workers is that your rollout_fragment_lengths also gets adjusted accordingly to make sure the train batch size remains as you configured it. You can try increasing your train_batch_size parameter at the same time as you increase num_workers.
Another alternative would be to try an async algo, such as APPO or IMPALA, which most likely will scale better than PPO.