Unable to restore Ray Tune previous experiment checkpoint

Andreas_Parasian · May 26, 2023, 2:32am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Here are snippets of my code:
Original tuner experiment =

from ray import tune
from ray.tune.search.basic_variant import BasicVariantGenerator

algo = BasicVariantGenerator(max_concurrent=2, random_state=0)

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

import numpy as np
from ray import air

# For non-grid-search tuning, please state the spec_num_samples = 10

tuner = tune.Tuner(
    fasttext_tuning_func,
    tune_config = tune.TuneConfig(metric="weighted macro f1-score",
                                  mode="max",
                                  search_alg=algo,
                                  trial_dirname_creator=lambda trial: trial.trainable_name + "_" + trial.trial_id),
    run_config=air.RunConfig(name="fasttext tuning",
                             local_dir=ipynb_path+'/fasttext tuning files',
                             verbose=1),
    param_space = search_space)

Restoring tuner experiment =

from ray import tune

results_experiment_path = ipynb_path+'/fasttext tuning files/fasttext tuning'

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

tuner = tune.Tuner.restore(results_experiment_path, trainable = fasttext_tuning_func,
                           param_space = search_space,
                           resume_unfinished = True,
                           resume_errored = True)

I was hyperparameter tuning a word embedding model for my use case and I would like to continue running the experiment. However, when I try restoring the previous checkpoint, the following warnings are issued =

Basically, the tuner can’t find the previous experiment checkpoint and has created a new experiment. How can I mitigate this so the tuner can recognize the previous experiment checkpoint and continue doing the previous experiment (rather than creating a new one)?

gjoliver · May 26, 2023, 4:53pm

Can you double check results_experiment_path actually exists?
It should be the top level of the an experiment folder.
For example, you should see things like tuner.pkl and checkpoint_00000/ in there.

Andreas_Parasian · May 26, 2023, 5:26pm

Yes, it is the right path; As you can see, there is the tuner.pkl file on the following screenshot:

However, there is no checkpoint_00000/ on the folder… Is that perhaps the fasttext_tuning_func_ sub-folders? I do use trial_dirname_creator to make it easier for reading the sub-folder names…

gjoliver · May 26, 2023, 7:31pm

ah ok, I think that makes a lot of sense.
since you specified an experiment name that don’t change across runs, all your runs are adding Trials into the same exp folder. and that may have made things complicated.
We recently filed a bug thread about this, [AIR] `trainer.pkl` and `tuner.pkl` files needed for restoration get replaced by new runs · Issue #35812 · ray-project/ray · GitHub

can I suggest you try adding a timestamp to your exp name, like

f"fasttext_tuning_{datetime.datetime.now().strftime("%Y%m%d%H%M%S")}"

that way, you will start a new exp folder everytime, and resume from the last separate exp folder.

gjoliver · May 26, 2023, 11:04pm

Looking at your error message a bit more, there may also be a bug with our path handling.
For example, the job is trying to download files from /Users/Torivo/Documents/... instead of C:/Users/Torivo/Documents/...

Can you print out ipynb_path+'/fasttext tuning files/fasttext tuning' to make sure it’s a path that starts with C:/...?

Andreas_Parasian · May 27, 2023, 6:27am

Thank you, for your suggestions. I will try setting the RunConfig name parameter to be f"fasttext_tuning_{datetime.datetime.now().strftime("%Y%m%d%H%M%S")}". Going to tell you how the tuning goes tommorow

Anyway, the path starts with C:/... as you can see on the following screenshot =

gjoliver · May 27, 2023, 5:35pm

Thanks. Then we probably have an issue on our end as well.
Please let us know how the test goes. Thanks.

Andreas_Parasian · May 28, 2023, 3:11am

Sorry, after correcting the paths through the following code =

from ray import tune
from ray.tune.search.basic_variant import BasicVariantGenerator

algo = BasicVariantGenerator(max_concurrent=2, random_state=0)

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

import os
import re

results_experiment_path = ipynb_path+'/fasttext tuning files/'+ [re.search("^fasttext_tuning_.*", i).group() for i in os.listdir(ipynb_path+'/fasttext tuning files/') if re.search("^fasttext_tuning_.*", i) != None][0]

from ray import tune

search_space = {"n_gram_size": tune.grid_search([2, 3]),
                "vector_size": tune.grid_search([300, 1200, 2400]),
                "fasttext_window": tune.grid_search([4, 8]),
                "train_epoch": tune.grid_search([5, 10]),
                "char_ngrams_length": tune.grid_search(list(conditional_max_n_and_min_n()))}

restored_tuner = tune.Tuner.restore(results_experiment_path, trainable = fasttext_tuning_func,
                           param_space = search_space,
                           resume_unfinished = True,
                           resume_errored = True)

I still stumbled on the No remote checkpoint was found or an error occurred when trying to download the experiment checkpoint. Please check the previous warning message for more details. Ray Tune will now start a new experiment warning message, as see on the following screenshot =

justinvyu · June 1, 2023, 9:33pm

Hi @Andreas_Parasian, it look like you’re running into an issue where Ray Tune thinks the results_experiment_path is a remote path (like on cloud storage), and then it tries to download to the same location (causing this “being used by another process” error). The windows path does not seem to be handled properly. I will create a Github issue and link it here.

In the meantime, could you paste the full results_experiment_path?

Topic		Replies	Views
Not able to resume experiment Ray Tune	5	961	December 12, 2022
How to restore after crash Ray Tune	4	815	January 14, 2021
Ray is not using GPU after restoring experiment with tune.Tune.restore() Ray Tune	2	201	January 8, 2024
Does Ray Tune restore ignore max_concurrent_trials when restarting errored trials? Ray Tune	2	272	June 30, 2023
Correct way of using tuner.restore() Ray Tune	6	2233	November 16, 2022

Unable to restore Ray Tune previous experiment checkpoint

Related topics