Hi,
I have subclassed the ray tune (1.7) example python/ray/tune/examples/pbt_tune_cifar10_with_keras.py to use DurableTrainable and seeing the below error. It is consistently reproducible and appreciate any insights on what I am doing wrong. I am passing the sync config to tune params. I do see checkpoints uploaded but in the middle of run I see the error
2021-11-01 02:12:17,667 ERROR worker.py:80 – Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::Cifar10Model.restore_from_object() (pid=16160, ip=172.31.63.11, repr=<pbt_tune_cifar10_with_keras.Cifar10Model object at 0x7f5f9d49ee10>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::Cifar10Model.save_to_object() (pid=15013, ip=172.31.63.11, repr=<pbt_tune_cifar10_with_keras.Cifar10Model object at 0x7f80fd95d650>)
File “/home/ec2-user/anaconda3/envs/ray17/lib/python3.7/site-packages/ray/tune/trainable.py”, line 360, in save_to_object
checkpoint_path = self.save(tmpdir)
File “/home/ec2-user/anaconda3/envs/ray17/lib/python3.7/site-packages/ray/tune/durable_trainable.py”, line 73, in save
raise ValueError("checkpoint_dir
must be self.logdir
, or "
ValueError: checkpoint_dir
must be self.logdir
, or a sub-directory.
Parameters I am passing
sync_config = tune.SyncConfig(
** sync_to_driver=False,**
** upload_dir=‘s3://scratch-fs/raydbg/’)**
analysis = tune.run(
Cifar10Model,
name="pbt_cifar10",
scheduler=pbt,
resources_per_trial={
"cpu": 1,
"gpu": 0
},
stop={
"mean_accuracy": 0.80,
"training_iteration": 30,
},
config=space,
num_samples=4,
metric="mean_accuracy",
mode="max",
**sync_config=sync_config,**
** sync_to_driver=False,**
** verbose=3**
)
Thank you!