Ray Tune Trials Failing to Resume After Saving and Restoring on Google Colab: AttributeError 'Checkpoint' Object Has No Attribute 'to_dict'

I am using Ray Tune on Google Colab to run a training job with 4 trials. Two trials completed successfully, but the other two have been stuck in a pending state for over 6 hours. I saved the Ray Tune results folder (/root/ray_results/train_cifar_2024-09-06_18-07-44) to my Google Drive and now want to stop the run and resume it later.

To resume the trials, I copied the saved folder back to /root/ray_results/train_cifar_2024-09-06_18-07-44 on Colab and tried to resume the run using the following configuration:

resume_config = ResumeConfig(
    unfinished=ResumeConfig.ResumeType.RESUME,  # Resume unfinished trials
    errored=ResumeConfig.ResumeType.RESUME      # Also resume errored trials if any
)

result = tune.run(
    partial(train_cifar, data_dir=data_dir),
    config=config,
    num_samples=num_samples,
    scheduler=scheduler,
    progress_reporter=reporter,
    restore='/root/ray_results/train_cifar_2024-09-06_18-07-44/',
    resume_config=resume_config
)

However, I encountered the following error:

ray.exceptions.RayTaskError(AttributeError): 'Checkpoint' object has no attribute 'to_dict'

The error seems to occur during the restore process, and none of the trials proceed beyond 0 iterations.

Here is the folder structure of /root/ray_results/train_cifar_2024-09-06_18-07-44:


└── ray_results
    └── train_cifar_2024-09-06_18-07-44
        ├── basic-variant-state-2024-09-06_18-07-44.json
        ├── experiment_state-2024-09-06_18-07-44.json
        ├── train_cifar_ea445_00000_0_batch_size=4,epochs=100,kfolds=5,lr=0.0010_2024-09-06_18-07-44
        │   ├── checkpoint_000000
        │   │   └── data.pkl
        │   ├── events.out.tfevents.1725646069.2b037ca6e5cc
        │   ├── params.json
        │   ├── params.pkl
        │   └── result.json
        ├── train_cifar_ea445_00001_1_batch_size=8,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44
        │   ├── checkpoint_000000
        │   │   └── data.pkl
        │   ├── events.out.tfevents.1725646069.2b037ca6e5cc
        │   ├── params.json
        │   ├── params.pkl
        │   └── result.json
        ├── train_cifar_ea445_00002_2_batch_size=8,epochs=70,kfolds=5,lr=0.0010_2024-09-06_18-07-44
        └── train_cifar_ea445_00003_3_batch_size=4,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44

I would like to resume the last two trials, but as you can see, their folders appear empty with no iterations completed. How can I properly resume these pending trials, ensuring that they continue from where they left off and do not restart from scratch? Is there a specific issue with how I am using the restore function or ResumeConfig, or do I need to manually handle the unfinished trials in a different way?

the trails i would like to continue is
train_cifar_ea445_00002_2_batch_size=8,epochs=70,kfolds=5,lr=0.0010_2024-09-06_18-07-44
and

train_cifar_ea445_00003_3_batch_size=4,epochs=40,kfolds=5,lr=0.0001_2024-09-06_18-07-44