Ray Tune Trials Failing to Resume After Saving and Restoring on Google Colab: AttributeError 'Checkpoint' Object Has No Attribute 'to_dict'

Rabee_Qasem · September 7, 2024, 12:02pm

I am using Ray Tune on Google Colab to run a training job with 4 trials. Two trials completed successfully, but the other two have been stuck in a pending state for over 6 hours. I saved the Ray Tune results folder (/root/ray_results/train_cifar_2024-09-06_18-07-44) to my Google Drive and now want to stop the run and resume it later.

To resume the trials, I copied the saved folder back to /root/ray_results/train_cifar_2024-09-06_18-07-44 on Colab and tried to resume the run using the following configuration:

resume_config = ResumeConfig(
    unfinished=ResumeConfig.ResumeType.RESUME,  # Resume unfinished trials
    errored=ResumeConfig.ResumeType.RESUME      # Also resume errored trials if any
)

result = tune.run(
    partial(train_cifar, data_dir=data_dir),
    config=config,
    num_samples=num_samples,
    scheduler=scheduler,
    progress_reporter=reporter,
    restore='/root/ray_results/train_cifar_2024-09-06_18-07-44/',
    resume_config=resume_config
)

However, I encountered the following error:

ray.exceptions.RayTaskError(AttributeError): 'Checkpoint' object has no attribute 'to_dict'

The error seems to occur during the restore process, and none of the trials proceed beyond 0 iterations.

Here is the folder structure of /root/ray_results/train_cifar_2024-09-06_18-07-44:


└── ray_results
    └── train_cifar_2024-09-06_18-07-44
        ├── basic-variant-state-2024-09-06_18-07-44.json
        ├── experiment_state-2024-09-06_18-07-44.json
        ├── train_cifar_ea445_00000_0_batch_size=4,epochs=100,kfolds=5,lr=0.0010_2024-09-06_18-07-44
        │   ├── checkpoint_000000
        │   │   └── data.pkl
        │   ├── events.out.tfevents.1725646069.2b037ca6e5cc
        │   ├── params.json
        │   ├── params.pkl
        │   └── result.json
        ├── train_cifar_ea445_00001_1_batch_size=8,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44
        │   ├── checkpoint_000000
        │   │   └── data.pkl
        │   ├── events.out.tfevents.1725646069.2b037ca6e5cc
        │   ├── params.json
        │   ├── params.pkl
        │   └── result.json
        ├── train_cifar_ea445_00002_2_batch_size=8,epochs=70,kfolds=5,lr=0.0010_2024-09-06_18-07-44
        └── train_cifar_ea445_00003_3_batch_size=4,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44

I would like to resume the last two trials, but as you can see, their folders appear empty with no iterations completed. How can I properly resume these pending trials, ensuring that they continue from where they left off and do not restart from scratch? Is there a specific issue with how I am using the restore function or ResumeConfig, or do I need to manually handle the unfinished trials in a different way?

the trails i would like to continue is
train_cifar_ea445_00002_2_batch_size=8,epochs=70,kfolds=5,lr=0.0010_2024-09-06_18-07-44
and

train_cifar_ea445_00003_3_batch_size=4,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44

Topic		Replies	Views
I cannot resume a broken tune run Ray Libraries (Data, Train, Tune, Serve)	2	425	September 10, 2023
Resume=True fails without useful error message RLlib	31	3088	September 26, 2022
Trial checkpointing Ray Libraries (Data, Train, Tune, Serve)	0	270	June 16, 2023
Retrieving the results_df on a crashed ray tune run Ray Tune	0	221	December 30, 2020
AttributeError: '_TrackedCheckpoint' object has no attribute 'value' Ray Tune	2	612	January 10, 2023

Ray Tune Trials Failing to Resume After Saving and Restoring on Google Colab: AttributeError 'Checkpoint' Object Has No Attribute 'to_dict'

Related topics