Redis connection resets

Hi,

I’ve been trying to set up a hyperparameter optimization with ray tune and it works fine with a small dummy dataset but I get an ConnectionResetError (104) when I use the proper dataset. There were some old tickets on github related to large objects in the object store but just dumping a large data structures in there manually works without problems. My code mostly follows the tutorials (train_model function with datasets as parameters):

    scheduler = ASHAScheduler(max_t=40,
                              grace_period=1,
                              reduction_factor=2)
    result = tune.run(tune.with_parameters(train_model, train_set=train_set, val_set=val_set),
                      resources_per_trial={"cpu": 2, "gpu": 0},
                      config=config,
                      metric="accuracy",
                      mode="max",
                      num_samples=1,
                      scheduler=scheduler,
                      verbose=2)

The exact stacktrace:

File “…/…/bkiessli/hyperparam_opt.py”, line 168, in
verbose=0)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/tune.py”, line 321, in run
restore=restore)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/experiment.py”, line 138, in init
self._run_identifier = Experiment.register_if_needed(run)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/experiment.py”, line 276, in register_if_needed
register_trainable(name, run_object)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/registry.py”, line 71, in register_trainable
_global_registry.register(TRAINABLE_CLASS, name, trainable)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/registry.py”, line 124, in register
self.flush_values()
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/tune/registry.py”, line 146, in flush_values
_internal_kv_put(_make_key(category, key), value, overwrite=True)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/ray/experimental/internal_kv.py”, line 27, in _internal_kv_put
updated = worker.redis_client.hset(key, “value”, value)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/client.py”, line 3004, in hset
return self.execute_command(‘HSET’, name, key, value)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/client.py”, line 877, in execute_command
conn.send_command(*args)
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/connection.py”, line 721, in send_command
check_health=kwargs.get(‘check_health’, True))
File “/sps/humanum/eScriptorium/bkiessli/anaconda/envs/kraken/lib/python3.7/site-packages/redis/connection.py”, line 713, in send_packed_command
(errno, errmsg))

What version of Ray are you on?

Can you try installing the latest nightly snapshot of master?

Sorry for the delayed answer, the cluster was shut down for maintenance. The latest nightly fixes it.