I am using Ray Tune on Google Colab to run a training job with 4 trials. Two trials completed successfully, but the other two have been stuck in a pending state for over 6 hours. I saved the Ray Tune results folder (/root/ray_results/train_cifar_2024-09-06_18-07-44
) to my Google Drive and now want to stop the run and resume it later.
To resume the trials, I copied the saved folder back to /root/ray_results/train_cifar_2024-09-06_18-07-44
on Colab and tried to resume the run using the following configuration:
resume_config = ResumeConfig(
unfinished=ResumeConfig.ResumeType.RESUME, # Resume unfinished trials
errored=ResumeConfig.ResumeType.RESUME # Also resume errored trials if any
)
result = tune.run(
partial(train_cifar, data_dir=data_dir),
config=config,
num_samples=num_samples,
scheduler=scheduler,
progress_reporter=reporter,
restore='/root/ray_results/train_cifar_2024-09-06_18-07-44/',
resume_config=resume_config
)
However, I encountered the following error:
ray.exceptions.RayTaskError(AttributeError): 'Checkpoint' object has no attribute 'to_dict'
The error seems to occur during the restore process, and none of the trials proceed beyond 0 iterations.
Here is the folder structure of /root/ray_results/train_cifar_2024-09-06_18-07-44:
└── ray_results
└── train_cifar_2024-09-06_18-07-44
├── basic-variant-state-2024-09-06_18-07-44.json
├── experiment_state-2024-09-06_18-07-44.json
├── train_cifar_ea445_00000_0_batch_size=4,epochs=100,kfolds=5,lr=0.0010_2024-09-06_18-07-44
│ ├── checkpoint_000000
│ │ └── data.pkl
│ ├── events.out.tfevents.1725646069.2b037ca6e5cc
│ ├── params.json
│ ├── params.pkl
│ └── result.json
├── train_cifar_ea445_00001_1_batch_size=8,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44
│ ├── checkpoint_000000
│ │ └── data.pkl
│ ├── events.out.tfevents.1725646069.2b037ca6e5cc
│ ├── params.json
│ ├── params.pkl
│ └── result.json
├── train_cifar_ea445_00002_2_batch_size=8,epochs=70,kfolds=5,lr=0.0010_2024-09-06_18-07-44
└── train_cifar_ea445_00003_3_batch_size=4,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44
I would like to resume the last two trials, but as you can see, their folders appear empty with no iterations completed. How can I properly resume these pending trials, ensuring that they continue from where they left off and do not restart from scratch? Is there a specific issue with how I am using the restore
function or ResumeConfig
, or do I need to manually handle the unfinished trials in a different way?
the trails i would like to continue is
train_cifar_ea445_00002_2_batch_size=8,epochs=70,kfolds=5,lr=0.0010_2024-09-06_18-07-44
and
train_cifar_ea445_00003_3_batch_size=4,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44