Model initialization is different inside vs outside Ray-Tune

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

One of the big hopes with Ray-tune is that once it finds a good hparam config C with metric result X, one can use the config C to do normal training outside of ray-tune and (after ensuring identical seeds and data inputs) get an identical training run to what occurred inside the ray-tune trial. The actual final metric may differ somewhat from X because of differences in early termination and such, but the evolution of model weights should be identical.

However the issue I’m observing is quite serious:

Model weight initialization inside a ray process (or actor, or whatever the right term is) differs from initialization outside of ray-tune (say in local mode), and even a seemingly slight difference can lead to dramatically divergent model-weight evolution between the ray-tune trial and normal training. Moreover, this issue disappears when using ray.init(local=True), i.e. all processes run locally. In other words, the problem is specifically discrepancy between a spawned Ray process vs normal Ray code running as the “main” process.

To be clear, I am not talking about reproducibility in the normal sense: I don’t really care if different runs give different results. However if a specific run gives a config C with result X, I expect to be able to achieve X with config C in normal training somewhere along the training path. Maybe this should be called replication of ray-tune results outside of ray-tune.

After repeatedly noticing differences between ray-tune runs and normal training runs, and several days isolating the issue, I have a tiny example highlighting the problem. (It’s quite possible there are other replication issues, but model initialization is at least one of them).

To see the issue, run this script with two settings of the 'local` variable:

  • local=True: forces all runs to be local, and the printed model-weights-sum is identical inside and outside ray-tune. The fact that the problem does not occur in local mode probably makes it that much harder to pin down.

  • local=False: the “normal” ray-tune setting which spawns sub-processes for trials. This is where the issue occurs: the weights initialized within the subprocess are different from the weights initialized outside ray-tune.

Note that this is a clean minimal example showing differences in weight-initialization. In my actual application, this difference in initial model weights results in dramatically different training paths and metrics, so IMO this is a real problem.

I would be very interested to know:

  • why does this happen?
  • are there any good workarounds for this initialization issue?
  • is it affected by any ray-related settings, e.g. about behavior of subprocesses etc?
# the lightning import below is the root cause of the problem --
# the import is not needed in the script, and if it is commented out,
# the problem disappears

from pytorch_lightning import seed_everything


import numpy as np
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from torch import nn
from torch.nn.init import xavier_uniform_,
from torch.nn.parameter import Parameter
import torch


class DumbModel(nn.Module):
    '''We only care about issues involving initialization, so
    there is no forward fn etc
    '''
    def __init__(self):
        super().__init__()
        self.wts = Parameter(torch.empty((3 * 128, 128)))
        xavier_normal_(self.wts)
        self.wts_sum = sum([x.abs().sum().item() for x in self.parameters()])
        print(f'WTS = {self.wts_sum}')

def train(config={}):
    mdl = DumbModel()
    return mdl.wts_sum

def tune_hparams(config={}):

    tune_scheduler = ASHAScheduler(
        max_t=10,
        grace_period=5,
        reduction_factor=2,
    )

    def trainable(config):
        seed = 0
        torch.manual_seed(0)
        np.random.seed(0)
        train(config={})
        # simulate training metric reports
        for _ in range(10):
            tune.report(obj = 3)

    resources_per_trial = dict(cpu = 1, gpu=0)

    if local:
        # in this case there is no anomaly:
        # model weights printed are same, inside tune.run and outside tune.run
        ray.init(local_mode=True)

    space = {
        # immaterial dummy entry; nothing to tune !
        'a': tune.grid_search([0])
    }

    tune.run(
        trainable,
        resources_per_trial=resources_per_trial,
        metric="obj",
        mode="max",
        config = space,
        num_samples=1,
        scheduler=tune_scheduler,
    )
    return None

#--------------------------------------------------------

# First set this variable to True or False
local = False

# Step 1: Run Tune and see how model weights are initialized
# (see the model-weights-sum printed)

tune_hparams()


# Step 2: Do "normal" training outside of Tune with the supposed best config.
# In this case since we are focusing on non-determinism/non-reproducibility,
#  there is no config to tune at all, except we first set the same seed as
#  used inside the trainable, and now observe the printed model-weights-sum

# ensure we set same seed as inside the trainable
np.random.seed(0)
torch.manual_seed(0)

train()

It turns out that the root cause of the problem is a Pytorch Lightning import of seed_everything. I’ve also simplified the code.

The script does not use seed_everything, and if this import is removed, the problem disappears. In my actual code, I use seed_everything, but while creating the minimal example I had left this import in there, and noticed that if I remove the import, the problem goes away in the minimal example.

So it looks like importing seed_everything results in different behavior between ray running with ray.init(local=True) vs ray.init(local=False)

I can’t reproduce the issue you’re seeing with the script. I made a slight modification to show the in-trainable results:

        res = train(config={})
        # simulate training metric reports
        tune.report(obj=res)

And with both local=True and local=False I get the same results:

+-----------------------+------------+----------------+-----+--------+------------------+---------+
| Trial name            | status     | loc            |   a |   iter |   total time (s) |     obj |
|-----------------------+------------+----------------+-----+--------+------------------+---------|
| trainable_a9412_00000 | TERMINATED | 127.0.0.1:7826 |   0 |      1 |       0.00226784 | 2451.01 |
+-----------------------+------------+----------------+-----+--------+------------------+---------+


2022-03-28 15:52:17,322	INFO tune.py:703 -- Total run time: 0.43 seconds (0.24 seconds for the tuning loop).
WTS = 2451.010009765625

no matter if I import seed_everything or not.

What is the result you’re observing?

One thing that should be noted is that the Ray Tune control loop uses random number generators e.g. to sample new configurations, or to shuffle around futures to balance results processing. In local mode it is likely that randomness of different trials and the control loop interact with each other and lead to different results. But that depends on the current implementation of local mode (which I’m not familiar with).

Ray tune is not well tested on local mode btw.

Thanks for looking into this, @kai

grepping the output of the script for WTS, with local=True, I get this output: (first is “in-trainable”, second is outside):

WTS = 2451.01025390625
WTS = 2451.01025390625

and with local=False:

(trainable pid=4140870) WTS = 2451.010009765625
WTS = 2451.01025390625

So a difference in the 4th decimal. Seemingly small but leads to large divergence as training progresses. After looking at numerous outputs across various runs, I know they should match to every single decimal shown. The above is with the seed_everything import at the top, of course.
Interestingly, if the import is moved to the bottom, the issue disappears again…