Unable to restore Ray Tune previous experiment checkpoint

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Here are snippets of my code:
Original tuner experiment =

from ray import tune
from ray.tune.search.basic_variant import BasicVariantGenerator

algo = BasicVariantGenerator(max_concurrent=2, random_state=0)

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

import numpy as np
from ray import air

# For non-grid-search tuning, please state the spec_num_samples = 10

tuner = tune.Tuner(
    fasttext_tuning_func,
    tune_config = tune.TuneConfig(metric="weighted macro f1-score",
                                  mode="max",
                                  search_alg=algo,
                                  trial_dirname_creator=lambda trial: trial.trainable_name + "_" + trial.trial_id),
    run_config=air.RunConfig(name="fasttext tuning",
                             local_dir=ipynb_path+'/fasttext tuning files',
                             verbose=1),
    param_space = search_space)

Restoring tuner experiment =

from ray import tune

results_experiment_path = ipynb_path+'/fasttext tuning files/fasttext tuning'

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

tuner = tune.Tuner.restore(results_experiment_path, trainable = fasttext_tuning_func,
                           param_space = search_space,
                           resume_unfinished = True,
                           resume_errored = True)

I was hyperparameter tuning a word embedding model for my use case and I would like to continue running the experiment. However, when I try restoring the previous checkpoint, the following warnings are issued =

Basically, the tuner can’t find the previous experiment checkpoint and has created a new experiment. How can I mitigate this so the tuner can recognize the previous experiment checkpoint and continue doing the previous experiment (rather than creating a new one)?

Can you double check results_experiment_path actually exists?
It should be the top level of the an experiment folder.
For example, you should see things like tuner.pkl and checkpoint_00000/ in there.

Yes, it is the right path; As you can see, there is the tuner.pkl file on the following screenshot:

However, there is no checkpoint_00000/ on the folder… Is that perhaps the fasttext_tuning_func_ sub-folders? I do use trial_dirname_creator to make it easier for reading the sub-folder names…

ah ok, I think that makes a lot of sense.
since you specified an experiment name that don’t change across runs, all your runs are adding Trials into the same exp folder. and that may have made things complicated.
We recently filed a bug thread about this, [AIR] `trainer.pkl` and `tuner.pkl` files needed for restoration get replaced by new runs · Issue #35812 · ray-project/ray · GitHub

can I suggest you try adding a timestamp to your exp name, like

f"fasttext_tuning_{datetime.datetime.now().strftime("%Y%m%d%H%M%S")}"

that way, you will start a new exp folder everytime, and resume from the last separate exp folder.

Looking at your error message a bit more, there may also be a bug with our path handling.
For example, the job is trying to download files from /Users/Torivo/Documents/... instead of C:/Users/Torivo/Documents/...

Can you print out ipynb_path+'/fasttext tuning files/fasttext tuning' to make sure it’s a path that starts with C:/...?

Thank you, for your suggestions. I will try setting the RunConfig name parameter to be f"fasttext_tuning_{datetime.datetime.now().strftime("%Y%m%d%H%M%S")}". Going to tell you how the tuning goes tommorow

Anyway, the path starts with C:/... as you can see on the following screenshot =
image

Thanks. Then we probably have an issue on our end as well.
Please let us know how the test goes. Thanks.

Sorry, after correcting the paths through the following code =

from ray import tune
from ray.tune.search.basic_variant import BasicVariantGenerator

algo = BasicVariantGenerator(max_concurrent=2, random_state=0)

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

import os
import re

results_experiment_path = ipynb_path+'/fasttext tuning files/'+ [re.search("^fasttext_tuning_.*", i).group() for i in os.listdir(ipynb_path+'/fasttext tuning files/') if re.search("^fasttext_tuning_.*", i) != None][0]

from ray import tune

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

restored_tuner = tune.Tuner.restore(results_experiment_path, trainable = fasttext_tuning_func,
                           param_space = search_space,
                           resume_unfinished = True,
                           resume_errored = True)

I still stumbled on the No remote checkpoint was found or an error occurred when trying to download the experiment checkpoint. Please check the previous warning message for more details. Ray Tune will now start a new experiment warning message, as see on the following screenshot =

Hi @Andreas_Parasian, it look like you’re running into an issue where Ray Tune thinks the results_experiment_path is a remote path (like on cloud storage), and then it tries to download to the same location (causing this “being used by another process” error). The windows path does not seem to be handled properly. I will create a Github issue and link it here.

In the meantime, could you paste the full results_experiment_path?