OOM command not allowed when used memory > max memory (solved)

So I am assuming the issue is running out of memory, however, I do not believe (at least from my understanding) that I should be consuming a large amount of memory for my task. Screenshots of an error stack:

Code for the tunerTrain function:

Code for the actual train function:

The red underline shows the only place where I reference the image dataset (2 datasets of 30k grayscale images - 28 x 28 pixels); with data being passed using tune.with_parameters(train, data=[X_2, original]). For every iteration I create a small tensor (of 64 images) to train, validate and then run the optimise step. Rinse and repeat. (I do a tuner.report every 10% of the total epochs)

So far, it looks like we’ve taken the right couple steps for debugging:

  1. Make a minimal tuning script (only 1 concurrent hyperparameter eval at once)
  2. Provided the traceback
  3. Provided context for your training function.

My main question now is:

  1. How is your training function defined?
  2. Can you post the output of:
import inspect; 
closure = inspect.getclosurevars(YOUR_TRAINING_FUNCTION)
print(closure)
``
  1. image

ClosureVars(nonlocals={}, globals={'torch': <module 'torch' from 'C:\\Python38\\lib\\site-packages\\torch\\__init__.py'>, 'nn': <module 'torch.nn' from 'C:\\Python38\\lib\\site-packages\\torch\\nn\\__init__.py'>, 'autoencoder': <class '__main__.autoencoder'>, 'os': <module 'os' from 'C:\\Python38\\lib\\os.py'>, 'np': <module 'numpy' from 'C:\\Python38\\lib\\site-packages\\numpy\\__init__.py'>, 'random': <module 'random' from 'C:\\Python38\\lib\\random.py'>, 'shape': (tensor([[[-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -0.6314,  0.2235,  0.6078,  0.9922,
           1.0000, -0.1216, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -0.5843,  0.6314,  0.9216,  0.9843,  0.9843,  0.9843,
           0.8824,  0.9529, -0.6078, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.9922,
          -0.4275,  0.8980,  0.9451,  0.6627,  0.0039, -0.5608, -0.5608,
          -0.0431,  0.9843, -0.2627, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.2078,
           0.9843,  0.6549, -0.4980, -1.0000, -1.0000, -1.0000, -1.0000,
          -0.4824,  0.9843,  0.6627, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.0118,
           0.1216, -0.8824, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -0.4824,  0.9843,  0.7725, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -0.0980,  0.9843,  0.1137, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.7490,
           0.9922,  0.9843, -0.0667, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.9451,  0.0118,
           0.9922,  0.9843,  0.9765,  0.9137, -0.2549, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.0588,  0.9843,
           0.9922,  0.8667,  0.7647,  0.9843,  0.9294, -0.6078, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.6706,  0.9059,  0.7098,
          -0.4824, -0.7490, -0.9765, -0.0510,  0.9843,  0.3725, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.0902, -0.5294,
          -1.0000, -1.0000, -1.0000, -1.0000,  0.8510,  0.9373, -0.6314,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.1922,  0.9843,  0.0588,
          -0.9843, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.7804,  0.8039,  0.9843,
          -0.6314, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.1137,  0.9843,
          -0.1922, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.2392,  0.9059,  0.8588,
          -0.8902, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -0.9843, -0.8510, -0.8824, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -0.9529,  0.4431,  0.9686,  0.2157,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -0.3255,  0.9843,  0.8510, -0.7961, -1.0000,
          -1.0000, -1.0000, -0.7255,  0.3255,  0.9843,  0.3098, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -0.5843,  0.8510,  0.9843, -0.3804, -1.0000,
          -0.2471,  0.5608,  0.9451,  0.9843,  0.3255, -0.8275, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -0.3020,  0.9765,  0.9529,  0.6941,
           0.8824,  0.9451,  0.7333, -0.1922, -0.8667, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.3412,  0.8745,  0.9843,
           0.3333, -0.5608, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000]]]), 3), 'tune': <module 'ray.tune' from 'C:\\Users\\denys\\AppData\\Roaming\\Python\\Python38\\site-packages\\ray\\tune\\__init__.py'>}, builtins={'range': <class 'range'>, 'float': <class 'float'>}, unbound={'optim', 'SGD', 'MSELoss', 'zero_grad', 'reshape', 'load_state_dict', 'parameters', 'save', 'unsqueeze', 'backward', 'load', 'from_numpy', 'randint', 'state_dict', 'step', 'path', 'report', 'item', 'checkpoint_dir', 'join'})

Is the second thing truncated? I can only see a bunch of floats, but I’m expecting something like:

ClosureVars(nonlocals={}, globals={'y': 0}, builtins={'print': <built-in function print>}, unbound=set())

(updated it for you)

It’s not, it was just wayyy to the side. Here is a better post: , 3), 'tune': <module 'ray.tune' from 'C:\\Users\\denys\\AppData\\Roaming\\Python\\Python38\\site-packages\\ray\\tune\\__init__.py'>}, builtins={'range': <class 'range'>, 'float': <class 'float'>}, unbound={'optim', 'SGD', 'MSELoss', 'zero_grad', 'reshape', 'load_state_dict', 'parameters', 'save', 'unsqueeze', 'backward', 'load', 'from_numpy', 'randint', 'state_dict', 'step', 'path', 'report', 'item', 'checkpoint_dir', 'join'})

OK, it seems like there’s a massive shape tensor that’s being passed in. Can you try instead passing this in through with_parameters?

So if I understood correctly:

import inspect;
closure = inspect.getclosurevars(tune.with_parameters(train, data=[X_2, original]))
print(closure)

In which case: there is still a big tensor

C:\Python38\python.exe C:/Users/denys/Documents/GitHub/autoencoder/temp.py
done loading data
ClosureVars(nonlocals={'fn': <function train at 0x00000247EFB461F0>, 'kwargs': {'data': [array([[[ 3.14413160e-01,  2.66814739e-01,  1.51836470e-01, ...,
          2.67461240e-01,  4.19622362e-02,  3.44016999e-01],
        [ 4.77885455e-01,  1.72530159e-01,  1.63852468e-01, ...,
          3.81365120e-01,  4.81234491e-02,  3.64480950e-02],
        [ 1.37383580e-01,  1.57012179e-01,  4.00399454e-02, ...,
          3.42016488e-01,  1.33462206e-01,  3.84698287e-02],
        ...,
        [ 7.43001401e-01,  1.23843804e-01,  2.61354297e-01, ...,
          3.81385624e-01,  1.21372916e-01,  2.86788017e-01],
        [ 1.71103120e-01,  4.37056303e-01,  3.57728302e-01, ...,
          9.67937484e-02,  2.92445153e-01,  6.83718696e-02],
        [ 1.58295289e-01,  2.59438723e-01, -1.17233524e-03, ...,
          6.05937660e-01,  5.49945652e-01,  1.71152905e-01]],

       [[ 1.95830211e-01, -1.56506241e-04,  1.73828676e-01, ...,
          5.71488500e-01,  1.52089875e-02,  6.01656921e-02],
        [ 1.69316471e-01,  3.68342191e-01,  1.57855526e-01, ...,
          5.17251968e-01,  4.15939540e-02,  3.23053688e-01],
        [ 2.22869918e-01,  1.80073768e-01,  5.45886159e-01, ...,
          6.39107227e-01,  9.01684612e-02,  3.03529263e-01],
        ...,
        [ 3.59689713e-01,  3.00460637e-01,  8.40228572e-02, ...,
          2.58773535e-01,  4.87608165e-02,  2.33878389e-01],
        [ 4.59576286e-02,  1.09919541e-01,  3.19949925e-01, ...,
          7.58249044e-01,  3.98986608e-01,  1.03371844e-01],
        [ 2.42822304e-01,  2.27658927e-01,  4.09674048e-01, ...,
          1.62231177e-01,  1.95023250e-02,  3.30401435e-02]],

       [[ 2.55591333e-01,  3.34725291e-01,  4.52484816e-01, ...,
          1.64843976e-01,  4.77518253e-02,  3.28568608e-01],
        [ 3.21747810e-02,  4.19021696e-01,  3.72200012e-01, ...,
          4.84991148e-02,  4.32600975e-01,  2.10576221e-01],
        [ 5.10671139e-01,  7.86206797e-02,  4.45414335e-01, ...,
          4.52821627e-02,  4.99240696e-01,  7.57409707e-02],
        ...,
        [ 1.28402114e-01,  1.21349432e-01,  3.99554282e-01, ...,
          1.50554135e-01,  7.57209837e-01,  1.79120809e-01],
        [ 3.19428705e-02,  6.61297888e-02,  1.97094604e-01, ...,
          6.57962203e-01,  1.20691590e-01,  6.36794627e-01],
        [ 4.73133087e-01,  2.37432212e-01,  3.71541053e-01, ...,
          6.26054481e-02,  7.00269401e-01,  1.97799653e-02]],

       ...,

       [[ 1.01426974e-01,  1.75128579e-02,  1.27624318e-01, ...,
          1.35151982e-01,  3.27666432e-01,  2.57929057e-01],
        [ 3.53936851e-01,  1.79896057e-01,  4.14070517e-01, ...,
          1.14886396e-01,  9.57630500e-02,  6.74486041e-01],
        [ 1.30767003e-01,  1.30645409e-01,  3.68428007e-02, ...,
          1.85788989e-01,  8.03550363e-01,  3.88052970e-01],
        ...,
        [ 4.67044720e-03,  3.26615199e-02,  2.61512339e-01, ...,
          5.58678210e-01,  1.69070676e-01,  8.21852311e-03],
        [ 3.40424687e-01,  1.53436750e-01,  3.71653587e-01, ...,
          3.22040826e-01,  1.80625677e-01,  2.79233512e-02],
        [ 4.42862600e-01,  1.23996012e-01,  4.22964692e-01, ...,
          5.14049269e-02,  3.69517893e-01,  3.50230336e-01]],

       [[ 9.46818944e-03,  2.32136957e-02,  3.02368701e-01, ...,
          4.61027086e-01,  4.18977380e-01,  1.00816928e-01],
        [ 5.46193756e-02,  1.38861826e-02,  1.31305709e-01, ...,
          3.48716646e-01,  1.43402070e-01,  9.84156728e-02],
        [ 2.82631844e-01,  1.25154838e-01,  1.24129102e-01, ...,
          5.64121902e-01,  3.72786254e-01,  2.52586067e-01],
        ...,
        [ 3.32753688e-01,  1.24956675e-01,  1.63458824e-01, ...,
          1.58261985e-01,  2.01665938e-01,  2.75481194e-02],
        [ 1.51398748e-01,  6.75603628e-01,  1.17378503e-01, ...,
          3.67904782e-01,  2.40638167e-01,  4.56800349e-02],
        [ 3.74852903e-02,  3.60247940e-02,  2.54040182e-01, ...,
          9.09478739e-02,  4.03055429e-01,  2.56420612e-01]],

       [[ 1.05971187e-01,  1.81191191e-01,  4.76713955e-01, ...,
          1.72313437e-01,  2.52096057e-01,  3.09260674e-02],
        [ 2.24350616e-01,  1.35009006e-01,  5.43949544e-01, ...,
          4.05353218e-01,  1.31176606e-01,  8.08024034e-02],
        [ 1.67647168e-01,  2.20531464e-01,  4.20103431e-01, ...,
          6.60191238e-01,  4.42887396e-01,  1.11123115e-01],
        ...,
        [ 4.37621534e-01,  2.30452254e-01,  8.74496251e-02, ...,
          8.68493598e-03,  3.12972456e-01,  3.29765856e-01],
        [ 6.47775531e-02,  7.63232857e-02,  5.32564044e-01, ...,
          1.35099575e-01,  5.90929575e-02,  1.94371507e-01],
        [ 1.49814233e-01,  1.91606984e-01,  2.60587186e-01, ...,
          8.75642225e-02,  2.58661717e-01,  5.20390749e-01]]],
      dtype=float32), array([[[-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        ...,
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157]],

       [[-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        ...,
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157]],

       [[-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        ...,
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157]],

       ...,

       [[-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        ...,
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157]],

       [[-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        ...,
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157]],

       [[-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        ...,
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157],
        [-0.00392157, -0.00392157, -0.00392157, ..., -0.00392157,
         -0.00392157, -0.00392157]]], dtype=float32)]},

followed by:

prefix': '<function train at 0x00000247EFB461F0>_', 'use_checkpoint': True}, globals={'inspect': <module 'inspect' from 'C:\\Python38\\lib\\inspect.py'>, 'parameter_registry': <ray.tune.registry._ParameterRegistry object at 0x00000247852D6F70>}, builtins={}, unbound={'parameters', 'default', 'signature', 'get'})

instead, can you do:

import inspect;
closure = inspect.getclosurevars(train)
print(closure)

without the with_parameters?

Yup, I’ve done that initially, this is what it looks like:

ClosureVars(nonlocals={}, globals={'torch': <module 'torch' from 'C:\\Python38\\lib\\site-packages\\torch\\__init__.py'>, 'nn': <module 'torch.nn' from 'C:\\Python38\\lib\\site-packages\\torch\\nn\\__init__.py'>, 'autoencoder': <class '__main__.autoencoder'>, 'os': <module 'os' from 'C:\\Python38\\lib\\os.py'>, 'np': <module 'numpy' from 'C:\\Python38\\lib\\site-packages\\numpy\\__init__.py'>, 'random': <module 'random' from 'C:\\Python38\\lib\\random.py'>, 'shape': (tensor([[[-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -0.6314,  0.2235,  0.6078,  0.9922,
           1.0000, -0.1216, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -0.5843,  0.6314,  0.9216,  0.9843,  0.9843,  0.9843,
           0.8824,  0.9529, -0.6078, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.9922,
          -0.4275,  0.8980,  0.9451,  0.6627,  0.0039, -0.5608, -0.5608,
          -0.0431,  0.9843, -0.2627, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.2078,
           0.9843,  0.6549, -0.4980, -1.0000, -1.0000, -1.0000, -1.0000,
          -0.4824,  0.9843,  0.6627, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.0118,
           0.1216, -0.8824, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -0.4824,  0.9843,  0.7725, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -0.0980,  0.9843,  0.1137, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.7490,
           0.9922,  0.9843, -0.0667, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.9451,  0.0118,
           0.9922,  0.9843,  0.9765,  0.9137, -0.2549, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.0588,  0.9843,
           0.9922,  0.8667,  0.7647,  0.9843,  0.9294, -0.6078, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.6706,  0.9059,  0.7098,
          -0.4824, -0.7490, -0.9765, -0.0510,  0.9843,  0.3725, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.0902, -0.5294,
          -1.0000, -1.0000, -1.0000, -1.0000,  0.8510,  0.9373, -0.6314,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.1922,  0.9843,  0.0588,
          -0.9843, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.7804,  0.8039,  0.9843,
          -0.6314, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -0.1137,  0.9843,
          -0.1922, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.2392,  0.9059,  0.8588,
          -0.8902, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -0.9843, -0.8510, -0.8824, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -0.9529,  0.4431,  0.9686,  0.2157,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -0.3255,  0.9843,  0.8510, -0.7961, -1.0000,
          -1.0000, -1.0000, -0.7255,  0.3255,  0.9843,  0.3098, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -0.5843,  0.8510,  0.9843, -0.3804, -1.0000,
          -0.2471,  0.5608,  0.9451,  0.9843,  0.3255, -0.8275, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -0.3020,  0.9765,  0.9529,  0.6941,
           0.8824,  0.9451,  0.7333, -0.1922, -0.8667, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -0.3412,  0.8745,  0.9843,
           0.3333, -0.5608, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
         [-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000]]]), 3), 
'tune': <module 'ray.tune' from 'C:\\Users\\denys\\AppData\\Roaming\\Python\\Python38\\site-packages\\ray\\tune\\__init__.py'>}, builtins={'range': <class 'range'>, 'float': <class 'float'>}, unbound={'optim', 'SGD', 'MSELoss', 'zero_grad', 'reshape', 'load_state_dict', 'parameters', 'save', 'unsqueeze', 'backward', 'load', 'from_numpy', 'randint', 'state_dict', 'step', 'path', 'report', 'item', 'checkpoint_dir', 'join'})

Those are both from one print, I just split them up for legibility sake

ah I mean, for some reason, there’s a shape tensor that is within the ClosureVars. This implies that you did something like:

shape = tensor()

def train(config, etc):
    result = calculate(x, shape)
    ...

This is bad because it causes data to be moved through Redis, which is not expected to hold large objects.

Instead, you should look into doing something like:

def train(config, shape=None, etc):
    result = calculate(x, shape)
    ...

tune.run(tune.with_parameters(train, shape=tensor()...))

If you do this correctly, you should see that: inspect.getclosurevars(train) returns a ClosureVars that does not have any numerical tensors.

Does that make sense?

Yes and no. I understand what you’re saying, but I’m not sure how it applies in my case.

def train(config, checkpoint_dir=None, data=None):

That’s my definition of train, inside I never reference anything large outside of the function scope…

Pastebin of my train function: https://pastebin.com/yUb0BuK8

To clarify, the issue arises if I reference a big tensor inside the function that is not instantiated inside of it, right? If I create a tensor inside its fine, but referencing an external one, that was not passed with params is bad?

Yep, that’s correct.

Okay, I am at my wit’s end here… I never call any shape object inside my train function… the closest is idx = np.random.randint(data[0].shape[0], size=batchAmount) However, data is passed in with_params as a numpy array and .shape simply returns a tuple of (total number of images, 28, 28)

If you comment out model = autoencoder(config), does shape disappear from ClosureVars?

The way to debug here is to comment out different lines in train to see if the shape parameter disappears from ClosureVars.

Note that this is going to help also solve this issue!

So with Richard Liaw’s help we narrowed the issue down to

tune.run(tune.with_parameters(train, data=[X_2, original])

Which was actually bugged and passing my whole dataset with the function through redis causing a massive slowdown. Instead, we had to refactor that out and use ray.get and ray.put to pass my dataset. So if anyone has this issue atm:

TL:DR

tune.with_parameters is bugged, do not use it. Use ray.put and ray.get for dataset passing!

This should be fixed on the latest master (next release!)