Hi,
I’ve been trying to set up a hyperparameter optimization with ray tune and it works fine with a small dummy dataset but I get an ConnectionResetError (104) when I use the proper dataset. There were some old tickets on github related to large objects in the object store but just dumping a large data structures in there manually works without problems. My code mostly follows the tutorials (train_model function with datasets as parameters):
scheduler = ASHAScheduler(max_t=40,
grace_period=1,
reduction_factor=2)
result = tune.run(tune.with_parameters(train_model, train_set=train_set, val_set=val_set),
resources_per_trial={"cpu": 2, "gpu": 0},
config=config,
metric="accuracy",
mode="max",
num_samples=1,
scheduler=scheduler,
verbose=2)
The exact stacktrace:
File “…/…/bkiessli/hyperparam_opt.py”, line 168, in
verbose=0)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/tune.py”, line 321, in run
restore=restore)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/experiment.py”, line 138, in init
self._run_identifier = Experiment.register_if_needed(run)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/experiment.py”, line 276, in register_if_needed
register_trainable(name, run_object)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/registry.py”, line 71, in register_trainable
_global_registry.register(TRAINABLE_CLASS, name, trainable)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/registry.py”, line 124, in register
self.flush_values()
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/registry.py”, line 146, in flush_values
_internal_kv_put(_make_key(category, key), value, overwrite=True)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/experimental/internal_kv.py”, line 27, in _internal_kv_put
updated = worker.redis_client.hset(key, “value”, value)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/client.py”, line 3004, in hset
return self.execute_command(‘HSET’, name, key, value)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/client.py”, line 877, in execute_command
conn.send_command(*args)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/connection.py”, line 721, in send_command
check_health=kwargs.get(‘check_health’, True))
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/connection.py”, line 713, in send_packed_command
(errno, errmsg))