I trying to run this example on multi-node cluster: pbt_example — Ray v1.10.0
It works fine on one machine but consitently fails when I use multiple nodes. I don’t have rsync installed and using aws s3 for upload_dir.
Ray version: 1.10
sync_config = tune.SyncConfig(upload_dir=“s3://mybucket/raytune/pbt/pbt_test/”)
pbt = PopulationBasedTraining(
time_attr=“training_iteration”,
perturbation_interval=20,
hyperparam_mutations={
# distribution for resampling
“lr”: lambda: random.uniform(0.0001, 0.02),
# allow perturbations within this set of categorical values
“some_other_factor”: [1, 2],
})
analysis = tune.run(
PBTBenchmarkExample,
name=“pbt_test”,
scheduler=pbt,
sync_config=sync_config,
local_dir="/opt/ml/model/checkpoints/",
metric=“mean_accuracy”,
mode=“max”,
fail_fast=True,
reuse_actors=True,
checkpoint_freq=20,
checkpoint_score_attr=“mean_accuracy”,
stop={
“training_iteration”: 200,
},
num_samples=8,
config=hpo_cfg,
)
#033[2m#033[36m(PBTBenchmarkExample pid=115, ip=100.x.x.111)#033[0m 2022-03-01 22:57:32,416#011INFO trainable.py:473 – Restored on 100.71.29.111 from checkpoint: /opt/ml/model/checkpoints/pbt_test29/PBTBenchmarkExample_f2eea_00001_1_2022-03-01_22-57-21/checkpoint_000040/checkpoint
#033[2m#033[36m(PBTBenchmarkExample pid=115, ip=100.x.x.111)#033[0m 2022-03-01 22:57:32,417#011INFO trainable.py:480 – Current state after restoring: {’_iteration’: 40, ‘_timesteps_total’: None, ‘_time_total’: 0.0021026134490966797, ‘_episodes_total’: None}
2022-03-01 22:57:32,608#011ERROR trial_runner.py:1128 – Trial PBTBenchmarkExample_f2eea_00006: Error processing restore.
Traceback (most recent call last):
File “/usr/local/lib/python3.7/site-packages/ray/tune/trial_runner.py”, line 1121, in _process_trial_restore
self.trial_executor.fetch_result(trial)
File “/usr/local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py”, line 707, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File “/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.7/site-packages/ray/worker.py”, line 1733, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): #033[36mray::PBTBenchmarkExample.restore()#033[39m (pid=536, ip=100.x.x.111, repr=<ray_pbt_tune2.PBTBenchmarkExample object at 0x7f5b1ca28690>)
File “/usr/local/lib/python3.7/site-packages/ray/tune/trainable.py”, line 453, in restore
with open(checkpoint_path + “.tune_metadata”, “rb”) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘/opt/ml/model/checkpoints/pbt_test29/PBTBenchmarkExample_f2eea_00006_6_2022-03-01_22-57-21/checkpoint_000040/checkpoint.tune_metadata’