PBT using DurableTrainable raises ValueError: `checkpoint_dir` must be `self.logdir`, or a sub-directory

anoop-ml · November 1, 2021, 2:26am

Hi,
I have subclassed the ray tune (1.7) example python/ray/tune/examples/pbt_tune_cifar10_with_keras.py to use DurableTrainable and seeing the below error. It is consistently reproducible and appreciate any insights on what I am doing wrong. I am passing the sync config to tune params. I do see checkpoints uploaded but in the middle of run I see the error

2021-11-01 02:12:17,667 ERROR worker.py:80 – Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::Cifar10Model.restore_from_object() (pid=16160, ip=172.31.63.11, repr=<pbt_tune_cifar10_with_keras.Cifar10Model object at 0x7f5f9d49ee10>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::Cifar10Model.save_to_object() (pid=15013, ip=172.31.63.11, repr=<pbt_tune_cifar10_with_keras.Cifar10Model object at 0x7f80fd95d650>)
File “/home/ec2-user/anaconda3/envs/ray17/lib/python3.7/site-packages/ray/tune/trainable.py”, line 360, in save_to_object
checkpoint_path = self.save(tmpdir)
File “/home/ec2-user/anaconda3/envs/ray17/lib/python3.7/site-packages/ray/tune/durable_trainable.py”, line 73, in save
raise ValueError("checkpoint_dir must be self.logdir, or "
ValueError: checkpoint_dir must be self.logdir, or a sub-directory.

Parameters I am passing

sync_config = tune.SyncConfig(
** sync_to_driver=False,**
** upload_dir=‘s3://scratch-fs/raydbg/’)**

analysis = tune.run(
    Cifar10Model,
    name="pbt_cifar10",
    scheduler=pbt,
    resources_per_trial={
        "cpu": 1,
        "gpu": 0
    },
    stop={
        "mean_accuracy": 0.80,
        "training_iteration": 30,
    },
    config=space,
    num_samples=4,
    metric="mean_accuracy",
    mode="max",
    **sync_config=sync_config,**

** sync_to_driver=False,**
** verbose=3**
)

Thank you!

Topic		Replies	Views
ValueError: The returned checkpoint path must be within the given checkpoint dir Ray Tune	7	395	January 25, 2021
Can't save Checkpoint wenn using Tensorflow and PBT Ray Tune	4	1325	January 12, 2021
Not a Directory error when loading checkpoint population based training Ray Tune	2	781	February 23, 2022
Checkpointing with distributed training Ray Tune	14	847	April 20, 2021
ValueError: Could not recover from checkpoint Ray Train	2	189	May 8, 2024

PBT using DurableTrainable raises ValueError: `checkpoint_dir` must be `self.logdir`, or a sub-directory

Related topics