I met this matter yesterday. I have a little free time today. I’d like to discuss it with you.
Yesterday, I wanted to save my tune task. I tried the resume mechanism mentioned in the manual. As the manual says, checkpoints are persisted at all times. But when I try to terminate the program and restart the tune task with resume as true, it can start normally, but the sampling number is not resumed. When I start in resume mode, ray just rerun the pending or running trails saved in the checkpoint. I found that it was because I used the BasicVariantGenerator which is default choice of search algorithm. It didn’t have any way to reach the checkpoint mechanism so that in the process of resume, the resume method of runner ignores the initialization of some parameters of the search algorithm object(I mean BasicVariantGenerator). I guess the _iteration in the runner_data in the checkpoint file corresponds to the total number of sampling. But it may not be because the _iteration is not always equal to the number of sampling specified in the tune.run method.
To sum up, I can’t normally terminate the tune task while using the basic variant generator and resume it. I can use the search algorithm with checkpoint saving mechanism to avoid this, such as HyperOptSearch that I am using, but is it reasonable that BasicVariantGenerator didn’t reach checkpoint saving mechanism?
What version of Ray are you on? I think this should work on the next Ray release (coming this week).
Thanks for reply. The version is v1.1.0 which I’m using. I found it’s the lastest release of ray.
In addition, I think the problem is hidden in the code below:
self.__setstate__(runner_state["runner_data"])
if self._search_alg.has_checkpoint(self._local_checkpoint_dir):
self._search_alg.restore_from_dir(self._local_checkpoint_dir)
you could find it in line 310-312 of tune/trial_runner.py.
This should be fixed on the Nightly version of Ray. Installing Ray — Ray v2.0.0.dev0
Thanks for the solution.
I’ve tried, and this version of ray does fix the problem. I looked at the source code and felt that it was similar to the modification method I imagined. I’m really sad to miss a PR chance, that’s just a kidding .
Finally, thank you again and look forward to brand new release of ray.
Emmm, I’m puzzled for another question.
Althouth I can resume tune task shutdowned before, but I can’t understand why the Number of Trials could be like this:
I set num_samples=2 and shutdown the tune after 1 trial. And I resumed the same tune, it could find there was one trial which the status is terminated. But the resumed tune process still runned 2 trials so that the total trials was 3. Is there any concept or feature that I miss?
Do I have to remove the old checkpoint files created before the last shutdown from disk before using resume?
Hmm seems like a bug; could you show me how to repro this (perhaps with a script and post on github issues)?
The issue has been created here
I want to know the issue is exactly a bug? Should I wait for a fix?