[PBT] No such file or directory

taqreez · March 2, 2022, 5:18pm

I trying to run this example on multi-node cluster: pbt_example — Ray v1.10.0
It works fine on one machine but consitently fails when I use multiple nodes. I don’t have rsync installed and using aws s3 for upload_dir.
Ray version: 1.10

sync_config = tune.SyncConfig(upload_dir=“s3://mybucket/raytune/pbt/pbt_test/”)

pbt = PopulationBasedTraining(
time_attr=“training_iteration”,
perturbation_interval=20,
hyperparam_mutations={
# distribution for resampling
“lr”: lambda: random.uniform(0.0001, 0.02),
# allow perturbations within this set of categorical values
“some_other_factor”: [1, 2],
})

analysis = tune.run(
PBTBenchmarkExample,
name=“pbt_test”,
scheduler=pbt,
sync_config=sync_config,
local_dir="/opt/ml/model/checkpoints/",
metric=“mean_accuracy”,
mode=“max”,
fail_fast=True,
reuse_actors=True,
checkpoint_freq=20,
checkpoint_score_attr=“mean_accuracy”,
stop={
“training_iteration”: 200,
},
num_samples=8,
config=hpo_cfg,

#033[2m#033[36m(PBTBenchmarkExample pid=115, ip=100.x.x.111)#033[0m 2022-03-01 22:57:32,416#011INFO trainable.py:473 – Restored on 100.71.29.111 from checkpoint: /opt/ml/model/checkpoints/pbt_test29/PBTBenchmarkExample_f2eea_00001_1_2022-03-01_22-57-21/checkpoint_000040/checkpoint
#033[2m#033[36m(PBTBenchmarkExample pid=115, ip=100.x.x.111)#033[0m 2022-03-01 22:57:32,417#011INFO trainable.py:480 – Current state after restoring: {’_iteration’: 40, ‘_timesteps_total’: None, ‘_time_total’: 0.0021026134490966797, ‘_episodes_total’: None}

2022-03-01 22:57:32,608#011ERROR trial_runner.py:1128 – Trial PBTBenchmarkExample_f2eea_00006: Error processing restore.
Traceback (most recent call last):
File “/usr/local/lib/python3.7/site-packages/ray/tune/trial_runner.py”, line 1121, in _process_trial_restore
self.trial_executor.fetch_result(trial)
File “/usr/local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py”, line 707, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File “/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.7/site-packages/ray/worker.py”, line 1733, in get
raise value.as_instanceof_cause()

ray.exceptions.RayTaskError(FileNotFoundError): #033[36mray::PBTBenchmarkExample.restore()#033[39m (pid=536, ip=100.x.x.111, repr=<ray_pbt_tune2.PBTBenchmarkExample object at 0x7f5b1ca28690>)
File “/usr/local/lib/python3.7/site-packages/ray/tune/trainable.py”, line 453, in restore
with open(checkpoint_path + “.tune_metadata”, “rb”) as f:

FileNotFoundError: [Errno 2] No such file or directory: ‘/opt/ml/model/checkpoints/pbt_test29/PBTBenchmarkExample_f2eea_00006_6_2022-03-01_22-57-21/checkpoint_000040/checkpoint.tune_metadata’

taqreez · March 3, 2022, 9:19am

can anybody have a look at this issue please?

xwjiang2010 · March 3, 2022, 5:03pm

Taking a look. Also @kai

taqreez · March 14, 2022, 3:16pm

@xwjiang2010 did you find anything?

xwjiang2010 · March 14, 2022, 4:37pm

Ah sorry, got distracted by something else. I will take a look today!

xwjiang2010 · March 15, 2022, 5:37am

Hi I took a look but couldn’t repro. Here is my script. I updated it to use SyncConfig(upload_dir=“s3://”)

#!/usr/bin/env python

import numpy as np
import argparse
import random

import ray
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining


class PBTBenchmarkExample(tune.Trainable):
    """Toy PBT problem for benchmarking adaptive learning rate.

    The goal is to optimize this trainable's accuracy. The accuracy increases
    fastest at the optimal lr, which is a function of the current accuracy.

    The optimal lr schedule for this problem is the triangle wave as follows.
    Note that many lr schedules for real models also follow this shape:

     best lr
      ^
      |    /\
      |   /  \
      |  /    \
      | /      \
      ------------> accuracy

    In this problem, using PBT with a population of 2-4 is sufficient to
    roughly approximate this lr schedule. Higher population sizes will yield
    faster convergence. Training will not converge without PBT.
    """

    def setup(self, config):
        self.lr = config["lr"]
        self.accuracy = 0.0  # end = 1000

    def step(self):
        midpoint = 100  # lr starts decreasing after acc > midpoint
        q_tolerance = 3  # penalize exceeding lr by more than this multiple
        noise_level = 2  # add gaussian noise to the acc increase
        # triangle wave:
        #  - start at 0.001 @ t=0,
        #  - peak at 0.01 @ t=midpoint,
        #  - end at 0.001 @ t=midpoint * 2,
        if self.accuracy < midpoint:
            optimal_lr = 0.01 * self.accuracy / midpoint
        else:
            optimal_lr = 0.01 - 0.01 * (self.accuracy - midpoint) / midpoint
        optimal_lr = min(0.01, max(0.001, optimal_lr))

        # compute accuracy increase
        q_err = max(self.lr, optimal_lr) / min(self.lr, optimal_lr)
        if q_err < q_tolerance:
            self.accuracy += (1.0 / q_err) * random.random()
        elif self.lr > optimal_lr:
            self.accuracy -= (q_err - q_tolerance) * random.random()
        self.accuracy += noise_level * np.random.normal()
        self.accuracy = max(0, self.accuracy)

        return {
            "mean_accuracy": self.accuracy,
            "cur_lr": self.lr,
            "optimal_lr": optimal_lr,  # for debugging
            "q_err": q_err,  # for debugging
            "done": self.accuracy > midpoint * 2,
        }

    def save_checkpoint(self, checkpoint_dir):
        return {
            "accuracy": self.accuracy,
            "lr": self.lr,
        }

    def load_checkpoint(self, checkpoint):
        self.accuracy = checkpoint["accuracy"]

    def reset_config(self, new_config):
        self.lr = new_config["lr"]
        self.config = new_config
        return True


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    parser.add_argument(
        "--cluster",
        action="store_true",
        help="Distribute tuning on a cluster")
    parser.add_argument(
        "--server-address",
        type=str,
        default=None,
        required=False,
        help="The address of server to connect to if using "
        "Ray Client.")
    args, _ = parser.parse_known_args()

    if args.server_address:
        ray.init(f"ray://{args.server_address}")
    elif args.cluster:
        ray.init(address="auto")
    elif args.smoke_test:
        ray.init(num_cpus=2)  # force pausing to happen for test
    else:
        ray.init()

    pbt = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=20,
        hyperparam_mutations={
            # distribution for resampling
            "lr": lambda: random.uniform(0.0001, 0.02),
            # allow perturbations within this set of categorical values
            "some_other_factor": [1, 2],
        })

    analysis = tune.run(
        PBTBenchmarkExample,
        name="pbt_test",
        scheduler=pbt,
        metric="mean_accuracy",
        mode="max",
        reuse_actors=True,
        checkpoint_freq=20,
        verbose=False,
        stop={
            "training_iteration": 200,
        },
        num_samples=40,
        config={
            "lr": 0.0001,
            # note: this parameter is perturbed but has no effect on
            # the model training in this example
            "some_other_factor": 1,
        },
        sync_config=tune.SyncConfig("s3://data-test-ilr/durable_upload"))

    print("Best hyperparameters found were: ", analysis.best_config)

Some sample output from my run

e[2me[36m(PBTBenchmarkExample pid=4094, ip=172.31.95.222)e[0m 2022-03-14 22:27:16,900	INFO trainable.py:535 -- Restored on 172.31.95.222 from checkpoint: /home/ray/ray_results/pbt_test/PBTBenchmarkExample_203f8_00000_0_2022-03-14_22-24-04/tmpmplvxj3erestore_from_object/checkpoint
e[2me[36m(PBTBenchmarkExample pid=4094, ip=172.31.95.222)e[0m 2022-03-14 22:27:16,900	INFO trainable.py:543 -- Current state after restoring: {'_iteration': 140, '_timesteps_total': None, '_time_total': 0.006560087203979492, '_episodes_total': None}

xwjiang2010 · March 15, 2022, 5:39am

@ taqreez
In your run, is it that every restore cannot find the checkpoint or just sometimes? I am trying to see if it’s consistent or random.
Also @kai One thing that is interesting is I think we restore from temporary checkpoint in this case, but somehow in @taqreez 's case, there is no tmp_xyz in the checkpoint path in the error msg. Does this ring a bell to you somehow?

kai · March 15, 2022, 5:33pm

Hi @taqreez, can you share more logs with us? E.g. the exploit messages would be helpful. A full run log would be best. Also, how does the directory structure look like inside your bucket for that experiment?
Can you share your cluster setup - how many nodes, which resources are you using? A cluster config would be best so we can reproduce easily. If you provide us with something we can “just run” (aside from S3 bucket etc) it makes it much easier for us to help you.
Lastly, can you try running again with Ray 1.11.0?

Topic		Replies	Views
[Tune PBT] Population Based Training :: Questions & Errors Ray Tune	3	990	April 1, 2021
WARNING:syncer.py:505 -- Last Sync command failed Ray Libraries (Data, Train, Tune, Serve)	1	178	August 30, 2023
PBT tune Question Ray Tune	3	365	April 1, 2022
PBT using DurableTrainable raises ValueError: `checkpoint_dir` must be `self.logdir`, or a sub-directory Ray Tune	0	351	November 1, 2021
Ray Tune PBT Trails pending while resources are available Ray Libraries (Data, Train, Tune, Serve)	3	323	January 20, 2023

[PBT] No such file or directory

Related Topics