Reproducibility Concerns with GPU

Stale_neutrino · September 26, 2022, 2:28pm

Hey guys,

So my team is concerned with the lack of reproducibility of our agent when training it with a GPU.

Some context, we are using a SAC (Ray 2.0), with the following config

Implemented the following suggestions:
https://github.com/ray-project/ray/blob/master/rllib/examples/deterministic_training.py (action space seeding)

Sanity check: Ran 2 training sessions (10K steps each) both SACs used the exact same config and seed number and here’s the reward curve (total reward vs ep number)

This is the seeding method I’m using, implemented both in the body and main of my script

Is there a setting I’m forgetting to use?

arturn · September 28, 2022, 12:59pm

@kourosh one of your favourite topics?

gjoliver · October 4, 2022, 4:28pm

This is what we have observed as well.
Same workload that is perfectly deterministic on CPU is not so deterministic on GPU.
I did some Google searches back then and concluded that because of the Parallelism and Asynchronous nature of GPU execution, it’s hard to make GPU training completely deterministic.
Long story short, I don’t think you forgot about any configuration.
And please share if you discover more about this topic.

Topic		Replies	Views
Reproducibility of training Results on PPO algorithm RLlib	4	470	September 24, 2021
Evolution strategies - make reproducible RLlib	1	520	July 14, 2021
Reproducibility of ray.tune with seeds RLlib	6	3039	July 26, 2022
Can't get Ray to use my GPU RLlib	5	3146	May 17, 2022
Reproducibility parmeter server failed Ray Core	1	409	April 20, 2022

Reproducibility Concerns with GPU

Related topics