Connection reset for large data, already using "tune.withparameters"

Similar to the recent “redi connection resets” thread I’ve been unable to pass my large dataset to the models. I have the following traceback

Traceback (most recent call last):
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/connection.py", line 700, in send_packed_command
    sendall(self._sock, item)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/_compat.py", line 8, in sendall
    return sock.sendall(*args, **kwargs)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hparam_search.py", line 151, in <module>
    train_infomax_asha(config, Dataset, ASHA)
  File "hparam_search.py", line 79, in train_infomax_asha
    analysis = tune.run(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/tune.py", line 299, in run
    experiments[i] = Experiment(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/experiment.py", line 138, in __init__
    self._run_identifier = Experiment.register_if_needed(run)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/experiment.py", line 276, in register_if_needed
    register_trainable(name, run_object)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/registry.py", line 71, in register_trainable
    _global_registry.register(TRAINABLE_CLASS, name, trainable)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/registry.py", line 124, in register
    self.flush_values()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/registry.py", line 146, in flush_values
    _internal_kv_put(_make_key(category, key), value, overwrite=True)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 27, in _internal_kv_put
    updated = worker.redis_client.hset(key, "value", value)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/client.py", line 3004, in hset
    return self.execute_command('HSET', name, key, value)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/client.py", line 877, in execute_command
    conn.send_command(*args)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/connection.py", line 720, in send_command
    self.send_packed_command(self.pack_command(*args),
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/connection.py", line 712, in send_packed_command
    raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.

my train func is as follows

def train_model(config, data=None, checkpoint_dir=None):
    callback = TuneReportCallback({"loss": "avg_val_loss"}, on="validation_end")
    print(config)
    trainer = pl.Trainer(
        gpus=1,
        callbacks=[basic_callbacks(), callback],
        **config["Trainer kwargs"],
        auto_select_gpus=True,
        precision=16
    )
    model = Attention_Infomax(config, Dataset)
    trainer.fit(model)

my run function is

    analysis = tune.run(
        tune.with_parameters(train_model, data=Dataset),
        resources_per_trial=resources_per_trial,
        progress_reporter=reporter,
        scheduler=scheduler,
        config=config,
        raise_on_failed_trial=False,
        max_failures=0,
        num_samples=10,
        search_alg=search_algorithm,
        name="nevergrad",
        mode="min",
        metric="loss",
    )

Can you try the latest nightly wheels via pip install -U [wheel-link]?

Here are the links to the wheels – https://docs.ray.io/en/master/installation.html#daily-releases-nightlies

I used “ray install-nightly” prior to making this post, was this equivalent? If so it has not solved the problem, but I will make a new conda env and follow this instruction if not!

No, the install-nightly unfortunately doesn’t work, on the docs we’ve made a note:

If you’re currently on ray<=1.0.1.post1, ray install-nightly will not install the most recent nightly wheels. Please use the links below instead.

I ran the command pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.1.0.dev0-cp38-cp38-manylinux2014_x86_64.whl and this doesnt seem to have fixed it

Here is the traceback

Traceback (most recent call last):
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/connection.py", line 706, in send_packed_command
    sendall(self._sock, item)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/_compat.py", line 9, in sendall
    return sock.sendall(*args, **kwargs)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hparam_search.py", line 151, in <module>
    train_infomax_asha(config, Dataset, ASHA)
  File "hparam_search.py", line 79, in train_infomax_asha
    analysis = tune.run(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/tune.py", line 304, in run
    experiments[i] = Experiment(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/experiment.py", line 149, in __init__
    self._run_identifier = Experiment.register_if_needed(run)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/experiment.py", line 287, in register_if_needed
    register_trainable(name, run_object)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/registry.py", line 71, in register_trainable
    _global_registry.register(TRAINABLE_CLASS, name, trainable)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/registry.py", line 124, in register
    self.flush_values()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/tune/registry.py", line 146, in flush_values
    _internal_kv_put(_make_key(category, key), value, overwrite=True)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 27, in _internal_kv_put
    updated = worker.redis_client.hset(key, "value", value)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/client.py", line 900, in execute_command
    conn.send_command(*args)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/connection.py", line 725, in send_command
    self.send_packed_command(self.pack_command(*args),
  File "/home/michael/anaconda3/envs/pytorch-cuda-11-tune-nightly/lib/python3.8/site-packages/redis/connection.py", line 717, in send_packed_command
    raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

OK got it. Can you try making the following change to your code:

    analysis = tune.run(
        tune.with_parameters(train_model, data=None),
        ...
    )

It won’t run properly, but it will just check if the data parameter is the culprit.

I get the same error with data = None but it still only fails when the large dataset has been loaded in weirdly enough.

My full training func is below

def train_infomax_asha(config, Dataset, scheduler):
    ray.init()
    reporter = CLIReporter(metric_columns=["loss", "training_iteration"])
    resources_per_trial = {"cpu": cpu_count(), "gpu": 1}

    search_algorithm = NevergradSearch(
        optimizer=ng.optimizers.OnePlusOne, mode="min", metric="loss"
    )

    analysis = tune.run(
        tune.with_parameters(train_model, data=None),
        resources_per_trial=resources_per_trial,
        progress_reporter=reporter,
        scheduler=scheduler,
        config=config,
        raise_on_failed_trial=False,
        max_failures=0,
        num_samples=10,
        search_alg=search_algorithm,
        name="nevergrad",
        mode="min",
        metric="loss",
    )
    print("Best hyperparameters found were: ", analysis.best_config)
    df = analysis.results_df
    df.to_csv("analysis_tune.csv")

It is working now, my mistake, the “inner training function” with data=None in it was calling Dataset not data and just loading the global variable. Large dataset training now works.

Will clarify that the nightly release is still necessary!

OK, I guess we should also push a fix to help determine the items take up the most space during pickling :slight_smile: