I am trying to use Ray Tune to optimize the hyperparameters of a pytorch_tabular
model:
def train_tabular(config):
model = build_model(num_trees=config['num_trees'], depth=config['depth'], num_layers=config['num_layers'], batch_size=config['batch_size'], use_embedding=True, epochs=10)
model.fit(train=df_train, validation=df_val)
eval = model.evaluate(df_val)
tune.report(mse=eval[0]['test_mean_squared_error'])
analysis = tune.run(
train_tabular, config=config)
However, the tuning run fails almost immediately with an error suggesting that there is an object that is too large.
# Hyperparameter optimization...
2021-09-01 18:59:18,303 INFO services.py:1265 -- View the Ray dashboard at http://127.0.0.1:8265
2021-09-01 18:59:19,987 WARNING function_runner.py:559 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2021-09-01 18:59:25,981 WARNING tune.py:506 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
2021-09-01 18:59:34,362 ERROR ray_trial_executor.py:581 -- Trial train_tabular_b8c01_00000: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 571, in start_trial
return self._start_trial(trial, checkpoint, train=train)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 450, in _start_trial
runner = self._setup_remote_runner(trial)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 367, in _setup_remote_runner
return full_actor_class.remote(**kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 488, in remote
override_environment_variables))
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 366, in _invocation_actor_class_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 705, in _remote
meta.method_meta.methods.keys())
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/function_manager.py", line 372, in export_actor_class
self._worker)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/utils.py", line 635, in check_oversized_function
raise ValueError(error)
ValueError: The actor ImplicitFunc is too large (493 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
2021-09-01 18:59:36,385 WARNING util.py:164 -- The `start_trial` operation took 6.931 s, which may be a performance bottleneck.
If anyone could suggest a reason for this error, and how I might deal with it, please reply to this thread. The model fits on its own without error. I am using Ray 1.6 on AWS (ubuntu).