Sanity check: Ran 2 training sessions (10K steps each) both SACs used the exact same config and seed number and here’s the reward curve (total reward vs ep number)
This is what we have observed as well.
Same workload that is perfectly deterministic on CPU is not so deterministic on GPU.
I did some Google searches back then and concluded that because of the Parallelism and Asynchronous nature of GPU execution, it’s hard to make GPU training completely deterministic.
Long story short, I don’t think you forgot about any configuration.
And please share if you discover more about this topic.