Error in Colab: ImplicitFunc is very large and grpc_status”:8

Greetings to the community!!

I am trying to grid search some parameters of my training function using ray tune.
The input data to train_cifar() used for training and testing are 2 lists of dimensions
400x13000 and 40x13000, respectively.

Due to size I cannot produce a reproducible example, but below I show three different
ways I have tried to ray tune my model.

In each case I receive the following error:

The actor ImplicitFunc is very large (95 MiB). Check that its definition is not implicitly
capturing a large array or other object in scope. Tip: use ray.put() to put large objects
in the Ray object store.

or this one:

debug_error_string = “{“created”:”@1643300850.335447653",“description”:
“Error received from peer ipv4:172.28.0.2:45437”, “file”:“src/core/lib/surface/call.cc”, “file_line”:1074, “grpc_message”: “Received message larger than max (137418486 vs. 104857600)”,“grpc_status”:8}"

I don’t understand what the limit of 95 MiB is since my lists are really small.

Any ideas of what am I doing wrong?

I am running the following codes to google’s Colab.

Kostas

CODE I

def train_cifar(config, data = None, checkpoint_dir=None):
  X_scaled_train_tmp = config["data1"]
  X_scaled_train2 = ray.get(X_scaled_train_tmp)

  X_scaled_test_tmp = config["data2"]
  X_scaled_test2 = ray.get(X_scaled_test_tmp)

def tunerTrain():
      config = {
        "data1" : X_scaled_train1,
        "data2" : X_scaled_test1,        
      }          
      scheduler = ASHAScheduler(
              ...
          )
      reporter = CLIReporter(
              ...
          )
      result = tune.run(
              partial(train_cifar, data_dir=data_dir),  
              ...
          )

tunerTrain()

CODE II

X_scaled_train = ...
X_scaled_test = ...

ray.init()
X_scaled_train1 = ray.put(X_scaled_train)
X_scaled_test1 = ray.put(X_scaled_test)

def train_cifar(config, data = None, checkpoint_dir=None):
  
  X_scaled_train2 = ray.get(data[0])
  X_scaled_test2 = ray.get(data[2])

def tunerTrain():
      config = {
              ...        
      }          
      scheduler = ASHAScheduler(
              ...
          )
      reporter = CLIReporter(
              ...
          )
      result = tune.run(
               tune.with_parameters(train_cifar, data=[X_scaled_train1, X_scaled_train_trait, 
                                                       X_scaled_test1, X_scaled_test_trait]),
              ...
          )

tunerTrain()

CODE III

X_scaled_train = ...
X_scaled_test = ...

def train_cifar(config, data = None, checkpoint_dir=None):
  
  X_scaled_train2 = data[0]
  X_scaled_test2 = data[2]

def tunerTrain():
      config = {
              ...        
      }          
      scheduler = ASHAScheduler(
              ...
          )
      reporter = CLIReporter(
              ...
          )
      result = tune.run(
               tune.with_parameters(train_cifar, data=[X_scaled_train, X_scaled_train_trait, 
                                                       X_scaled_test, X_scaled_test_trait]),
              ...
          )

tunerTrain()

Hi there! I suspect that there’s probably a leak in your code implementation somewhere. Can you try making sure that wherever you create or put the data, you do it in tunerTrain()?

I suspect that you’re referencing X_scaled_train directly somewhere in train_cifar.