Porting over conversation from slack channel.
Hi i am trying to run tune over a pytorch distributed group with population based training. Currently, I have the tune set up and running my custom training code. I start with
analysis = tune.run(
trainer,
metric = "reward",
mode = "max",
name = conf["learn"]["experiment_name"] + "_tune",
local_dir = "./results",
num_samples = conf["learn"]["tune_population_size"],
# config = search_space,
# search_alg = search_alg,
scheduler = scheduler,
resources_per_trial = None,
sync_config = tune.SyncConfig(
syncer = None # Disable syncing
),
progress_reporter = reporter
)
Then, in trainer, I have
def tune_train(config, conf, evalOb, checkpoint_dir = None):
trainer_type = conf["learn"]["trainer"].lower()
conf["learn"]["gamma"][0] = config["operator_gamma"]
conf["learn"]["training_batch_size"][0] = config["operator_batch_size"]
conf["model"][f"{trainer_type}_hidden_sizes"] = [config["hidden1"], config["hidden2"]]
conf['model'][f"{trainer_type}_dropout_rate"] = [max(config["dropout1"], 1), max(config["dropout2"], 1)]
conf["model"][f"{trainer_type}_activation"] = config["activation"]
conf["learn"][f"{trainer_type}_lr"][0] = config["lr"]
conf["learn"][f"{trainer_type}_lr_decay_rate"][0] = config["lr_decay_rate"]
conf["learn"][f"{trainer_type}_lr_decay_freq"][0] = config["lr_decay_freq"]
if "n_steps_update" in config:
conf["learn"]["n_steps_update"] = config["n_steps_update"]
if checkpoint_dir is not None:
print(checkpoint_dir, "\n\n===================================", flush=True)
conf["directories"]["checkpoint"] = os.path.join(checkpoint_dir, "checkpoint")
conf["learn"]["load_checkpoint"] = True
problem = define_problem(conf)
trainer = define_trainer(conf, problem, evalOb)
trainer.train()
which reads in a checkpoint.
Finally, in trainer.train, I call
model_info = self._save_checkpoint(train_type)
if ray.is_initialized():
with distributed_checkpoint_dir(step = self.learning_steps_completed[0]) as checkpoint_dir:
checkpoint_path = os.path.join(checkpoint_dir, f"{self.checkpoint_name}_{train_type}.pt")
print("Checkpointing", checkpoint_path, flush=True)
torch.save(model_info, checkpoint_path)
A trail is checkpointed at /WrappedDistributedTorchTrainable_600a8_00000_0_2022-02-22_13-07-00/worker_0/checkpoint_001000
Then, another trial receives a checkpoint_dir value to start a previous run. However, the checkpoint_dir name looks like /WrappedDistributedTorchTrainable_600a8_00000_0_2022-02-22_13-07-00/worker_0/checkpoint_tmp156ab9/./
. Where the main directory is correct but the specific checkout contains “tmp…” when this was not part of any previous checkpoints that were created. I’m not sure what is happening and could use some advice.
From Xiaowei:
Hi, short answer is this is expected. Long answer is:
Say we have a good performing trial - trialA that is checkpointed on machine A and we want to start another trial (trialB) from A’s checkpoint and add a little mutation on machine B. What really happens is:
- trialA is checkpointed to a temporary file on machine A
- the temporary checkpoint is loaded into Ray’s global object store
- so it is accessible on machine B as well
- transform the object store checkpoint into a temporary file on machine B
- restore from that temporary file on machine B (this is where you see that tmp folder)
Implementation details aside, are you running into issues?
I also want to mention that we have recently introduced Ray Train for distributed training. You may want to give it a try. It also supports seamless co-operation with Ray Tune.
From Matthew:
Entire stacktrace
(WrappedDistributedTorchTrainable pid=54757) 2022-02-23 09:03:37,474 INFO trainable.py:472 -- Restored on 192.168.128.2 from checkpoint: /usr/WS1/landen2/drl/drl/results/ieee14_tune/WrappedDistributedTorchTrainable_208a4_00003_3_2022-02-23_09-00-39/tmpuxw0xslerestore_from_object/./
(WrappedDistributedTorchTrainable pid=54757) 2022-02-23 09:03:37,475 INFO trainable.py:480 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 133.75925588607788, '_episodes_total': None}
(func pid=54731) 2022-02-23 09:03:37,473 INFO trainable.py:472 -- Restored on 192.168.128.2 from checkpoint: /usr/WS1/landen2/drl/drl/results/ieee14_tune/WrappedDistributedTorchTrainable_208a4_00003_3_2022-02-23_09-00-39/worker_0/checkpoint_tmp2329bf/./
(func pid=54731) 2022-02-23 09:03:37,474 INFO trainable.py:480 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 133.75925588607788, '_episodes_total': None}
(func pid=54731) Traceback (most recent call last):
(func pid=54731) File "../drl/trainer/distributed_trainer.py", line 132, in train
(func pid=54731) self.do_train()
(func pid=54731) File "../drl/trainer/distributed_trainer.py", line 171, in do_train
(func pid=54731) self.load_checkpoint()
(func pid=54731) File "../drl/trainer/dqn_trainer.py", line 202, in load_checkpoint
(func pid=54731) model_info = torch.load(checkpoint_path)
(func pid=54731) File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
(func pid=54731) with _open_file_like(f, 'rb') as opened_file:
(func pid=54731) File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
(func pid=54731) return _open_file(name_or_buffer, mode)
(func pid=54731) File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
(func pid=54731) super(_open_file, self).__init__(open(name, mode))
(func pid=54731) FileNotFoundError: [Errno 2] No such file or directory: '/usr/WS1/landen2/drl/drl/results/ieee14_tune/WrappedDistributedTorchTrainable_208a4_00003_3_2022-02-23_09-00-39/worker_0/checkpoint_tmp2329bf/./checkpoint/ieee14_0.pt'
(func pid=54749) 2022-02-23 09:03:37,462 INFO trainable.py:472 -- Restored on 192.168.128.2 from checkpoint: /usr/WS1/landen2/drl/drl/results/ieee14_tune/WrappedDistributedTorchTrainable_208a4_00003_3_2022-02-23_09-00-39/worker_1/checkpoint_tmp848714/./
(func pid=54749) 2022-02-23 09:03:37,462 INFO trainable.py:480 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 133.75925588607788, '_episodes_total': None}
(func pid=54749) Traceback (most recent call last):
(func pid=54749) File "../drl/trainer/distributed_trainer.py", line 132, in train
(func pid=54749) self.do_train()
(func pid=54749) File "../drl/trainer/distributed_trainer.py", line 178, in do_train
(func pid=54749) self.make_groups()
(func pid=54749) File "../drl/trainer/distributed_trainer.py", line 128, in make_groups
(func pid=54749) self.groups.append(dist.new_group([0, i]))
(func pid=54749) File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2898, in new_group
(func pid=54749) pg = _new_process_group_helper(
(func pid=54749) File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 684, in _new_process_group_helper
(func pid=54749) pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
(func pid=54749) RuntimeError: Connection reset by peer
2022-02-23 09:03:38,689 ERROR trial_runner.py:927 -- Trial WrappedDistributedTorchTrainable_208a4_00003: Error processing event.
Traceback (most recent call last):
File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 893, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 707, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/ray/worker.py", line 1733, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::WrappedDistributedTorchTrainable.train() (pid=54757, ip=192.168.128.2, repr=<ray.tune.integration.torch.WrappedDistributedTorchTrainable object at 0x2ab4cb576480>)
File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/ray/tune/trainable.py", line 315, in train
result = self.step()
File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/ray/tune/integration/torch.py", line 120, in step
result = ray.get([w.step.remote() for w in self.workers])[0]
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.step() (pid=54731, ip=192.168.128.2, repr=func)
File "/usr/workspace/landen2/prefix/toss_3_x86_64_ib/conda-2020.11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 388, in step
raise TuneError(
ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception.