Why can't I resume the tune task normally when using basic variant generator?

EscapeReality8460894 · February 9, 2021, 4:18am

I met this matter yesterday. I have a little free time today. I’d like to discuss it with you.
Yesterday, I wanted to save my tune task. I tried the resume mechanism mentioned in the manual. As the manual says, checkpoints are persisted at all times. But when I try to terminate the program and restart the tune task with resume as true, it can start normally, but the sampling number is not resumed. When I start in resume mode, ray just rerun the pending or running trails saved in the checkpoint. I found that it was because I used the BasicVariantGenerator which is default choice of search algorithm. It didn’t have any way to reach the checkpoint mechanism so that in the process of resume, the resume method of runner ignores the initialization of some parameters of the search algorithm object(I mean BasicVariantGenerator). I guess the _iteration in the runner_data in the checkpoint file corresponds to the total number of sampling. But it may not be because the _iteration is not always equal to the number of sampling specified in the tune.run method.
To sum up, I can’t normally terminate the tune task while using the basic variant generator and resume it. I can use the search algorithm with checkpoint saving mechanism to avoid this, such as HyperOptSearch that I am using, but is it reasonable that BasicVariantGenerator didn’t reach checkpoint saving mechanism?

rliaw · February 9, 2021, 4:27am

What version of Ray are you on? I think this should work on the next Ray release (coming this week).

EscapeReality8460894 · February 9, 2021, 4:32am

Thanks for reply. The version is v1.1.0 which I’m using. I found it’s the lastest release of ray.
In addition, I think the problem is hidden in the code below:

        self.__setstate__(runner_state["runner_data"])
        if self._search_alg.has_checkpoint(self._local_checkpoint_dir):
            self._search_alg.restore_from_dir(self._local_checkpoint_dir)

you could find it in line 310-312 of tune/trial_runner.py.

rliaw · February 9, 2021, 4:35am

This should be fixed on the Nightly version of Ray. Installing Ray — Ray v2.0.0.dev0

EscapeReality8460894 · February 9, 2021, 4:58am

Thanks for the solution.
I’ve tried, and this version of ray does fix the problem. I looked at the source code and felt that it was similar to the modification method I imagined. I’m really sad to miss a PR chance, that’s just a kidding .
Finally, thank you again and look forward to brand new release of ray.

EscapeReality8460894 · February 10, 2021, 9:13am

Emmm, I’m puzzled for another question.
Althouth I can resume tune task shutdowned before, but I can’t understand why the Number of Trials could be like this:

I set num_samples=2 and shutdown the tune after 1 trial. And I resumed the same tune, it could find there was one trial which the status is terminated. But the resumed tune process still runned 2 trials so that the total trials was 3. Is there any concept or feature that I miss?
Do I have to remove the old checkpoint files created before the last shutdown from disk before using resume?

rliaw · February 10, 2021, 9:32am

Hmm seems like a bug; could you show me how to repro this (perhaps with a script and post on github issues)?

EscapeReality8460894 · February 10, 2021, 10:32am

The issue has been created here

EscapeReality8460894 · February 12, 2021, 12:03pm

I want to know the issue is exactly a bug? Should I wait for a fix?

Topic		Replies	Views
Resuming tune optimization from previously explored configurations	2	897	October 3, 2023
I cannot resume a broken tune run	2	455	September 10, 2023
Resume tuning after updating search space with more hyperparameters	12	722	February 15, 2023
How to restore after crash Ray Tune	4	815	January 14, 2021
Unable to restore Ray Tune previous experiment checkpoint Ray Tune	8	989	June 1, 2023

Why can't I resume the tune task normally when using basic variant generator?

Related topics