Hey Everyone… I am trying to use PPO algorithm (available in ray[rlllib]).
With num_workers = 4, I get reproducible results on CPU machine (my local). However on a GPU machine using num_workers = 20, it is not giving reproducible results.
Cam someone help on this ?
Hi @Mohini,
Are you using tf or torch? If torch, your issue may be related to this bug I just filed this morning: [Bug] [RLLIB] Race condition in stats_fn when using multi-gpu · Issue #18812 · ray-project/ray · GitHub
Hey @mannyv,
Thanks much for the re-direction. I am using tf in my current setup. Also, I am not using multiple GPU (num_gpus = 0). It’s only the num_workers which is utilized.
num_workers = 4 (local, CPU machine, gives reproducible results).
num_workers = 4 (GPU machine, doesn’t gives reproducible results).
@Mohini OK well at least we can rule that out. Do you have a reproduction script available?
Hey @Mohini and @mannyv , very interesting topic
Actually, we were looking into the same issue, which we think might be related to this code here in rllib/utils/debug.py::update_global_seed_if_necessary()
, which is used when you set the seed
config key to some int value (not None).
# Torch.
if framework == "torch":
torch, _ = try_import_torch()
torch.manual_seed(seed)
# See https://github.com/pytorch/pytorch/issues/47672.
cuda_version = torch.version.cuda
if cuda_version is not None and float(torch.version.cuda) >= 10.2:
os.environ["CUBLAS_WORKSPACE_CONFIG"] = "4096:8"
else:
from distutils.version import LooseVersion
if LooseVersion(torch.__version__) >= LooseVersion("1.8.0"):
# Not all Operations support this.
torch.use_deterministic_algorithms(True)
else:
torch.set_deterministic(True)
# This is only for Convolution no problem.
torch.backends.cudnn.deterministic = True
So in case of the GPU, we never call the torch.use_deterministic_algorithms(True)
. Not sure whether this is correct.