How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
One of the big hopes with Ray-tune is that once it finds a good hparam config C with metric result X, one can use the config C to do normal training outside of ray-tune and (after ensuring identical seeds and data inputs) get an identical training run to what occurred inside the ray-tune trial. The actual final metric may differ somewhat from X because of differences in early termination and such, but the evolution of model weights should be identical.
However the issue I’m observing is quite serious:
Model weight initialization inside a ray process (or actor, or whatever the right term is) differs from initialization outside of ray-tune (say in local mode), and even a seemingly slight difference can lead to dramatically divergent model-weight evolution between the ray-tune trial and normal training. Moreover, this issue disappears when using
ray.init(local=True)
, i.e. all processes run locally. In other words, the problem is specifically discrepancy between a spawned Ray process vs normal Ray code running as the “main” process.
To be clear, I am not talking about reproducibility in the normal sense: I don’t really care if different runs give different results. However if a specific run gives a config C with result X, I expect to be able to achieve X with config C in normal training somewhere along the training path. Maybe this should be called replication of ray-tune results outside of ray-tune.
After repeatedly noticing differences between ray-tune runs and normal training runs, and several days isolating the issue, I have a tiny example highlighting the problem. (It’s quite possible there are other replication issues, but model initialization is at least one of them).
To see the issue, run this script with two settings of the 'local` variable:
-
local=True
: forces all runs to be local, and the printed model-weights-sum is identical inside and outside ray-tune. The fact that the problem does not occur in local mode probably makes it that much harder to pin down. -
local=False
: the “normal” ray-tune setting which spawns sub-processes for trials. This is where the issue occurs: the weights initialized within the subprocess are different from the weights initialized outside ray-tune.
Note that this is a clean minimal example showing differences in weight-initialization. In my actual application, this difference in initial model weights results in dramatically different training paths and metrics, so IMO this is a real problem.
I would be very interested to know:
- why does this happen?
- are there any good workarounds for this initialization issue?
- is it affected by any ray-related settings, e.g. about behavior of subprocesses etc?
# the lightning import below is the root cause of the problem --
# the import is not needed in the script, and if it is commented out,
# the problem disappears
from pytorch_lightning import seed_everything
import numpy as np
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from torch import nn
from torch.nn.init import xavier_uniform_,
from torch.nn.parameter import Parameter
import torch
class DumbModel(nn.Module):
'''We only care about issues involving initialization, so
there is no forward fn etc
'''
def __init__(self):
super().__init__()
self.wts = Parameter(torch.empty((3 * 128, 128)))
xavier_normal_(self.wts)
self.wts_sum = sum([x.abs().sum().item() for x in self.parameters()])
print(f'WTS = {self.wts_sum}')
def train(config={}):
mdl = DumbModel()
return mdl.wts_sum
def tune_hparams(config={}):
tune_scheduler = ASHAScheduler(
max_t=10,
grace_period=5,
reduction_factor=2,
)
def trainable(config):
seed = 0
torch.manual_seed(0)
np.random.seed(0)
train(config={})
# simulate training metric reports
for _ in range(10):
tune.report(obj = 3)
resources_per_trial = dict(cpu = 1, gpu=0)
if local:
# in this case there is no anomaly:
# model weights printed are same, inside tune.run and outside tune.run
ray.init(local_mode=True)
space = {
# immaterial dummy entry; nothing to tune !
'a': tune.grid_search([0])
}
tune.run(
trainable,
resources_per_trial=resources_per_trial,
metric="obj",
mode="max",
config = space,
num_samples=1,
scheduler=tune_scheduler,
)
return None
#--------------------------------------------------------
# First set this variable to True or False
local = False
# Step 1: Run Tune and see how model weights are initialized
# (see the model-weights-sum printed)
tune_hparams()
# Step 2: Do "normal" training outside of Tune with the supposed best config.
# In this case since we are focusing on non-determinism/non-reproducibility,
# there is no config to tune at all, except we first set the same seed as
# used inside the trainable, and now observe the printed model-weights-sum
# ensure we set same seed as inside the trainable
np.random.seed(0)
torch.manual_seed(0)
train()