[PBT] No such file or directory

I trying to run this example on multi-node cluster: pbt_example — Ray v1.10.0
It works fine on one machine but consitently fails when I use multiple nodes. I don’t have rsync installed and using aws s3 for upload_dir.
Ray version: 1.10

sync_config = tune.SyncConfig(upload_dir=“s3://mybucket/raytune/pbt/pbt_test/”)

pbt = PopulationBasedTraining(
time_attr=“training_iteration”,
perturbation_interval=20,
hyperparam_mutations={
# distribution for resampling
“lr”: lambda: random.uniform(0.0001, 0.02),
# allow perturbations within this set of categorical values
“some_other_factor”: [1, 2],
})

analysis = tune.run(
PBTBenchmarkExample,
name=“pbt_test”,
scheduler=pbt,
sync_config=sync_config,
local_dir="/opt/ml/model/checkpoints/",
metric=“mean_accuracy”,
mode=“max”,
fail_fast=True,
reuse_actors=True,
checkpoint_freq=20,
checkpoint_score_attr=“mean_accuracy”,
stop={
“training_iteration”: 200,
},
num_samples=8,
config=hpo_cfg,

)

#033[2m#033[36m(PBTBenchmarkExample pid=115, ip=100.x.x.111)#033[0m 2022-03-01 22:57:32,416#011INFO trainable.py:473 – Restored on 100.71.29.111 from checkpoint: /opt/ml/model/checkpoints/pbt_test29/PBTBenchmarkExample_f2eea_00001_1_2022-03-01_22-57-21/checkpoint_000040/checkpoint
#033[2m#033[36m(PBTBenchmarkExample pid=115, ip=100.x.x.111)#033[0m 2022-03-01 22:57:32,417#011INFO trainable.py:480 – Current state after restoring: {’_iteration’: 40, ‘_timesteps_total’: None, ‘_time_total’: 0.0021026134490966797, ‘_episodes_total’: None}

2022-03-01 22:57:32,608#011ERROR trial_runner.py:1128 – Trial PBTBenchmarkExample_f2eea_00006: Error processing restore.
Traceback (most recent call last):
File “/usr/local/lib/python3.7/site-packages/ray/tune/trial_runner.py”, line 1121, in _process_trial_restore
self.trial_executor.fetch_result(trial)
File “/usr/local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py”, line 707, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File “/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.7/site-packages/ray/worker.py”, line 1733, in get
raise value.as_instanceof_cause()

ray.exceptions.RayTaskError(FileNotFoundError): #033[36mray::PBTBenchmarkExample.restore()#033[39m (pid=536, ip=100.x.x.111, repr=<ray_pbt_tune2.PBTBenchmarkExample object at 0x7f5b1ca28690>)
File “/usr/local/lib/python3.7/site-packages/ray/tune/trainable.py”, line 453, in restore
with open(checkpoint_path + “.tune_metadata”, “rb”) as f:

FileNotFoundError: [Errno 2] No such file or directory: ‘/opt/ml/model/checkpoints/pbt_test29/PBTBenchmarkExample_f2eea_00006_6_2022-03-01_22-57-21/checkpoint_000040/checkpoint.tune_metadata’

can anybody have a look at this issue please?

Taking a look. Also @kai

@xwjiang2010 did you find anything?

Ah sorry, got distracted by something else. I will take a look today!

Hi I took a look but couldn’t repro. Here is my script. I updated it to use SyncConfig(upload_dir=“s3://”)

#!/usr/bin/env python

import numpy as np
import argparse
import random

import ray
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining


class PBTBenchmarkExample(tune.Trainable):
    """Toy PBT problem for benchmarking adaptive learning rate.

    The goal is to optimize this trainable's accuracy. The accuracy increases
    fastest at the optimal lr, which is a function of the current accuracy.

    The optimal lr schedule for this problem is the triangle wave as follows.
    Note that many lr schedules for real models also follow this shape:

     best lr
      ^
      |    /\
      |   /  \
      |  /    \
      | /      \
      ------------> accuracy

    In this problem, using PBT with a population of 2-4 is sufficient to
    roughly approximate this lr schedule. Higher population sizes will yield
    faster convergence. Training will not converge without PBT.
    """

    def setup(self, config):
        self.lr = config["lr"]
        self.accuracy = 0.0  # end = 1000

    def step(self):
        midpoint = 100  # lr starts decreasing after acc > midpoint
        q_tolerance = 3  # penalize exceeding lr by more than this multiple
        noise_level = 2  # add gaussian noise to the acc increase
        # triangle wave:
        #  - start at 0.001 @ t=0,
        #  - peak at 0.01 @ t=midpoint,
        #  - end at 0.001 @ t=midpoint * 2,
        if self.accuracy < midpoint:
            optimal_lr = 0.01 * self.accuracy / midpoint
        else:
            optimal_lr = 0.01 - 0.01 * (self.accuracy - midpoint) / midpoint
        optimal_lr = min(0.01, max(0.001, optimal_lr))

        # compute accuracy increase
        q_err = max(self.lr, optimal_lr) / min(self.lr, optimal_lr)
        if q_err < q_tolerance:
            self.accuracy += (1.0 / q_err) * random.random()
        elif self.lr > optimal_lr:
            self.accuracy -= (q_err - q_tolerance) * random.random()
        self.accuracy += noise_level * np.random.normal()
        self.accuracy = max(0, self.accuracy)

        return {
            "mean_accuracy": self.accuracy,
            "cur_lr": self.lr,
            "optimal_lr": optimal_lr,  # for debugging
            "q_err": q_err,  # for debugging
            "done": self.accuracy > midpoint * 2,
        }

    def save_checkpoint(self, checkpoint_dir):
        return {
            "accuracy": self.accuracy,
            "lr": self.lr,
        }

    def load_checkpoint(self, checkpoint):
        self.accuracy = checkpoint["accuracy"]

    def reset_config(self, new_config):
        self.lr = new_config["lr"]
        self.config = new_config
        return True


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    parser.add_argument(
        "--cluster",
        action="store_true",
        help="Distribute tuning on a cluster")
    parser.add_argument(
        "--server-address",
        type=str,
        default=None,
        required=False,
        help="The address of server to connect to if using "
        "Ray Client.")
    args, _ = parser.parse_known_args()

    if args.server_address:
        ray.init(f"ray://{args.server_address}")
    elif args.cluster:
        ray.init(address="auto")
    elif args.smoke_test:
        ray.init(num_cpus=2)  # force pausing to happen for test
    else:
        ray.init()

    pbt = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=20,
        hyperparam_mutations={
            # distribution for resampling
            "lr": lambda: random.uniform(0.0001, 0.02),
            # allow perturbations within this set of categorical values
            "some_other_factor": [1, 2],
        })

    analysis = tune.run(
        PBTBenchmarkExample,
        name="pbt_test",
        scheduler=pbt,
        metric="mean_accuracy",
        mode="max",
        reuse_actors=True,
        checkpoint_freq=20,
        verbose=False,
        stop={
            "training_iteration": 200,
        },
        num_samples=40,
        config={
            "lr": 0.0001,
            # note: this parameter is perturbed but has no effect on
            # the model training in this example
            "some_other_factor": 1,
        },
        sync_config=tune.SyncConfig("s3://data-test-ilr/durable_upload"))

    print("Best hyperparameters found were: ", analysis.best_config)

Some sample output from my run

e[2me[36m(PBTBenchmarkExample pid=4094, ip=172.31.95.222)e[0m 2022-03-14 22:27:16,900	INFO trainable.py:535 -- Restored on 172.31.95.222 from checkpoint: /home/ray/ray_results/pbt_test/PBTBenchmarkExample_203f8_00000_0_2022-03-14_22-24-04/tmpmplvxj3erestore_from_object/checkpoint
e[2me[36m(PBTBenchmarkExample pid=4094, ip=172.31.95.222)e[0m 2022-03-14 22:27:16,900	INFO trainable.py:543 -- Current state after restoring: {'_iteration': 140, '_timesteps_total': None, '_time_total': 0.006560087203979492, '_episodes_total': None}

@ taqreez
In your run, is it that every restore cannot find the checkpoint or just sometimes? I am trying to see if it’s consistent or random.
Also @kai One thing that is interesting is I think we restore from temporary checkpoint in this case, but somehow in @taqreez 's case, there is no tmp_xyz in the checkpoint path in the error msg. Does this ring a bell to you somehow?

Hi @taqreez, can you share more logs with us? E.g. the exploit messages would be helpful. A full run log would be best. Also, how does the directory structure look like inside your bucket for that experiment?
Can you share your cluster setup - how many nodes, which resources are you using? A cluster config would be best so we can reproduce easily. If you provide us with something we can “just run” (aside from S3 bucket etc) it makes it much easier for us to help you.
Lastly, can you try running again with Ray 1.11.0?