Tuning fails with "The actor ImplicitFunc is too large"

I am trying to use Ray Tune to optimize the hyperparameters of a pytorch_tabular model:

def train_tabular(config):
    model = build_model(num_trees=config['num_trees'], depth=config['depth'], num_layers=config['num_layers'], batch_size=config['batch_size'], use_embedding=True, epochs=10)

    model.fit(train=df_train, validation=df_val)
    eval = model.evaluate(df_val)
    tune.report(mse=eval[0]['test_mean_squared_error'])


analysis = tune.run(
    train_tabular, config=config)

However, the tuning run fails almost immediately with an error suggesting that there is an object that is too large.

# Hyperparameter optimization...
2021-09-01 18:59:18,303	INFO services.py:1265 -- View the Ray dashboard at http://127.0.0.1:8265
2021-09-01 18:59:19,987	WARNING function_runner.py:559 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2021-09-01 18:59:25,981	WARNING tune.py:506 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
2021-09-01 18:59:34,362	ERROR ray_trial_executor.py:581 -- Trial train_tabular_b8c01_00000: Unexpected error starting runner.
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 571, in start_trial
    return self._start_trial(trial, checkpoint, train=train)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 450, in _start_trial
    runner = self._setup_remote_runner(trial)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 367, in _setup_remote_runner
    return full_actor_class.remote(**kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 488, in remote
    override_environment_variables))
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 366, in _invocation_actor_class_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 705, in _remote
    meta.method_meta.methods.keys())
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/function_manager.py", line 372, in export_actor_class
    self._worker)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/utils.py", line 635, in check_oversized_function
    raise ValueError(error)
ValueError: The actor ImplicitFunc is too large (493 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
2021-09-01 18:59:36,385	WARNING util.py:164 -- The `start_trial` operation took 6.931 s, which may be a performance bottleneck.

If anyone could suggest a reason for this error, and how I might deal with it, please reply to this thread. The model fits on its own without error. I am using Ray 1.6 on AWS (ubuntu).

Hey @fonnesbeck,

It could be due to the size of df_train or df_val. To confirm this, you can try a simple script like this:

from ray import cloudpickle as pickle

pickled = pickle.dumps(df_train)
length_mib = len(pickled) // (1024 * 1024)
print(length_mib)

If this is the case, you can reduce the size of the serialized function by using the tune.with_parameters API to pass in these datasets.

1 Like

Using with_parameters worked. Thanks!

1 Like