Based on your description I think this makes sense.
When you have more than one worker they are all collecting and reporting data samples at the same time in parallel. This introduces non-determinism in the ordering of samples in the training batch across seperate runs.
When you only have 1 worker then it becomes dtleterministic. When the mini-batch size is the same size as the training batch size then you are using all the data for each gradient update so across runs you are always updating with the same data and even thought the ordering of samples may differ the gradient update is deterministic since that ordering does not matter in PPO.
Another option, not currently implemented I don’t think, would be to sort the training batch samples by worker index before training.
Or retrieve samples from each worker in order. This would slow down sample throughput quite significantly if you have a lot of workers.
You would make that change here:
to something like:
sample_batches = [ray.get(worker.sample.remote()) for worker in worker_set.remote_workers()]