FileNotFoundError when resuming from Checkpoint

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Dear all,

I have RL program using RLlib and Tune, it start to be working quite ok.
Now, I have an issue regarding checkpointing and resuming after interruption.

I run Tune using this config (among other things):

tune.run(
  resume="AUTO",
  checkpoint_freq=1,
  keep_checkpoints_num=5,
  checkpoint_score_attr="episode_reward_mean")

When my container gets stopped and want to resume, it fails and gives this error:

2022-08-06 00:41:40,753	INFO trial_runner.py:515 -- A local experiment checkpoint was found and will be used to restore the previous experiment state.
2022-08-06 00:41:40,936	WARNING trial_runner.py:644 -- Attempting to resume experiment from /mnt/azureml/cr/j/e0c99e57a6ba4186be930b3053dd2219_2/cap/data-capability/wd/MODEL_OUTPUT/RayTesting. This will ignore any new changes to the specification.
2022-08-06 00:41:41,144	INFO tune.py:647 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
2022-08-06 00:41:41,321	ERROR ray_trial_executor.py:533 -- Trial SimpleQTrainer_RLlib_Gym_Environment_27871_00004: Unexpected error starting runner.

Traceback (most recent call last):
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 526, in start_trial
    return self._start_trial(trial)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 433, in _start_trial
    self.restore(trial)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 744, in restore
    obj = TrainableUtil.checkpoint_to_object(value)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 103, in checkpoint_to_object
    data_dict = TrainableUtil.pickle_checkpoint(checkpoint_path)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 83, in pickle_checkpoint
    checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 118, in find_checkpoint_dir
    raise FileNotFoundError("Path does not exist", checkpoint_path)
FileNotFoundError: [Errno Path does not exist] /mnt/azureml/cr/j/e0c99e57a6ba4186be930b3053dd2219/cap/data-capability/wd/MODEL_OUTPUT/RayTesting/27871_00004/checkpoint_000119/checkpoint-119

Traceback (most recent call last):
  File ".source_code/Main_RLlib.py", line 248, in <module>
    analysis = tune_run()
  File ".source_code/Main_RLlib.py", line 224, in tune_run
    analysis = tune.run(SimpleQTrainer,  # SimpleQTrainer | PGTrainer
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/tune.py", line 741, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [SimpleQTrainer_RLlib_Gym_Environment_27871_00004, SimpleQTrainer_RLlib_Gym_Environment_27871_00002, SimpleQTrainer_RLlib_Gym_Environment_27871_00000, SimpleQTrainer_RLlib_Gym_Environment_27871_00005, SimpleQTrainer_RLlib_Gym_Environment_27871_00001, SimpleQTrainer_RLlib_Gym_Environment_27871_00003])

(I cut out the irrelevant parts)

I ran the above code on Azure Machine Learning service, but I can reproduce it locally when using the same Docker image.

I start the Docker container:

  1. If I kill the docker container before more than 5 iterations and then restart the container, resume always works, although sometimes Tune start from te beginning!
  2. Suppose I let a trail run for 10 iterations, if I kill the docker container slowly min 5 seconds after the 10th iteration finished and then, restart the container, resume work as expected, and the first iteration printed is 11 (based on a print from callback log_trial_result) .
  3. Suppose I let a trail run for 10 iterations, if I kill the docker container quickly, max 1 seconds after the 10th iteration finished (based on a print from callback log_trial_result) and then, restart the container, resume fails with the above error. It tries to resume from the 9th iteration, and (usually) this has been deleted. If I inspect the output folder I can see the checkpoint folder checkpoint_000010 exists, and the 9th was already deleted because of keep_checkpoints_num=5.

The issue does not occur when using keep_checkpoints_num=None. For this test, a checkpoint size was about <1MB. However, we are planning to store more things and then, when doing serious training, we will run in storage issues with keep_checkpoints_num=None which will become difficult to manage.

Somehow, the presence of the checkpoint_000010 folder does not mean Tune want to use it for resuming? The absence of the checkpoint_000009 folder does not withhold Tune from trying to resume from here?
So how does Tune determine which checkpoint to use when resuming? Doen’t it look at these folders and is this information stored somewhere else?

Based on my above reproduction analysis, this issue seems to have a small probability of occurring, the Docker container has to be killed within 1 second after finishing an iteration. However, on Azure, resuming seems to almost always fail with this exact error. Debugging this on Azure is very slow and cumbersome so it’s difficult to collect more information.

My guess is that the storage on Azure is much slower because of (network) bandwidth limitations. Therefore, there is a much longer window in which the most recent checkpoint is still being written (or something) while the previous checkpoint has already been deleted (thus there is a larger open window for this race condition).
This is just a guess, and I cannot back this up right now. Maybe with some pointers to the Tune source code we can confirm/disprove this.

Question:
Am I doing something wrong? Or is this a bug and should I post this on the GitHub? Or is this a known issue (I cannot find anything about it)?

Any insight in about how the checkpointing mechanism works regarding this would help in investigating and solving this further.

One quick fix could be (but I haven’t tested it), to also keep the last x checkpoints besides only the 5 best checkpoints which are being kept right now. Then, Tune can always resume from the before last checkpoint.

Details:
I use Ray 1.13.0

1 Like

@Stefan-1313
Thanks for the detailed description. What you described is not expected to happen. Feel free to file a bug and assign it to me.
Can you reproduce it 100% of the time? At the first glance, I agree with you that racing condition is unlikely - we are talking about a buffer with 5 checkpoints here.
I would be curious to further debug with a minimal repro script from you if possible.
For deciding which checkpoints to use for each trial - Tune book keeps a list of best checkpoint paths for each trial (5 in this case). When the program was being terminated, that information is kept in some files under experiment/ folder. When tuning is being resumed, the information is loaded (using name of the experiment for look up). And then each trial is restored from its latest checkpoint.

1 Like

Hi @xwjiang2010 ,

Thanks for looking into this! If the conditions (stated in my post) for the scenario’s are I can reproduce this 100% of the time during testing.

Only exceptions are when Tune wants to resume from an unexpected checkpoints (the issue), but keep_checkpoints_num has decided to keep also this checkpoint, that is just luck with a larger probability for short tests since there are not so many deleted checkpoints yet. For the first scenario sometimes “Tune finds a checkpoint, but starts from the beginning anyway” I think is because of the same issue, and starting from the beginning is always possible so no crash (but also no expected resume).

Right now I don’t have a minimum working example. But I will make one, see if it reproduces there, and then post in on the GitHub. I don’t have permissions assign issues to anyone, but I can tag jou. Is your GitHub username the same as on this discussion forum (I checked and I believe so)?

My code uses a custom environment, model and exploration class but nothing special regarding checkpointing.

1 Like

Thank you.
So this is the test script I use to test the checkpoint and resume functionality. It is working fine for me. See following:

from ray import tune
import time
import os, json

class MyTrainable(tune.Trainable):
        

    def setup(self, config):
        self.timestep = 0
        self.is_restored = False

    def step(self):
        time.sleep(1)
        self.timestep += 1

        # Here we use `episode_reward_mean`, but you can also report other
        # objectives such as loss or accuracy.
        return {"my_metrics": self.timestep}

    def save_checkpoint(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"timestep": self.timestep}))
        return path

    def load_checkpoint(self, checkpoint_path):
        with open(checkpoint_path) as f:
            self.timestep = json.loads(f.read())["timestep"]

tune.run(MyTrainable, name="my_run", resume="AUTO", checkpoint_freq=1, keep_checkpoints_num=5, checkpoint_score_attr="my_metrics")

1 Like

Hi @xwjiang2010 ,

Thanks for looking into it! You are slightly too quick as I was still working on the GitHub issue.

I made a minimum working example without any custom stuff, only vanilla Ray blocks and a vanilla Gym environment (LunarLander).
The example reproduces also 100% of the time giving the specific timing requirement I put in the reproduction steps.

At first, the minimum working example would not reproduce. But then I increased the network update time, so I actually had time to wait, without having a new iteration to be finished already which interferes with the reproduction steps.
After this addition, it started to reproduce consequently. I tried to put this detail as clear as possible in the GitHub issue also.

This is the link to the GitHub issue, I also mentioned you there, so you can assign it to yourself in case you want to (I cannot I think): GitHub issue.

EDIT:
I just looked at your minimum working example and it looks good (thanks!):
The differences I notice with respect to my minimum working example:

  1. You use a custom Trainable, I use SimpleQTrainer, maybe the issue only occurs for SimpleQTrainer?

  2. Also, it seems from your terminal output you give Python time to stop gracefully. I had to force stop Python, for the issue to reproduce. When gracefully stopping, somehow Tune gets the opportunity it needs to round up in a good way (flush buffers?). In PyCharm, I can do this by hitting the red stop button quickly multiple times in a row (just pressing it once wil gracefully stop the Python Process and the issue will not reproduce). The issue reproduces only when force stopping the Python process. I believe Docker will also just kill everything inside it, including the Python process, when stopping the Container.