- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Dear all,
I have RL program using RLlib and Tune, it start to be working quite ok.
Now, I have an issue regarding checkpointing and resuming after interruption.
I run Tune using this config (among other things):
tune.run(
resume="AUTO",
checkpoint_freq=1,
keep_checkpoints_num=5,
checkpoint_score_attr="episode_reward_mean")
When my container gets stopped and want to resume, it fails and gives this error:
2022-08-06 00:41:40,753 INFO trial_runner.py:515 -- A local experiment checkpoint was found and will be used to restore the previous experiment state.
2022-08-06 00:41:40,936 WARNING trial_runner.py:644 -- Attempting to resume experiment from /mnt/azureml/cr/j/e0c99e57a6ba4186be930b3053dd2219_2/cap/data-capability/wd/MODEL_OUTPUT/RayTesting. This will ignore any new changes to the specification.
2022-08-06 00:41:41,144 INFO tune.py:647 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
2022-08-06 00:41:41,321 ERROR ray_trial_executor.py:533 -- Trial SimpleQTrainer_RLlib_Gym_Environment_27871_00004: Unexpected error starting runner.
Traceback (most recent call last):
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 526, in start_trial
return self._start_trial(trial)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 433, in _start_trial
self.restore(trial)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 744, in restore
obj = TrainableUtil.checkpoint_to_object(value)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 103, in checkpoint_to_object
data_dict = TrainableUtil.pickle_checkpoint(checkpoint_path)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 83, in pickle_checkpoint
checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 118, in find_checkpoint_dir
raise FileNotFoundError("Path does not exist", checkpoint_path)
FileNotFoundError: [Errno Path does not exist] /mnt/azureml/cr/j/e0c99e57a6ba4186be930b3053dd2219/cap/data-capability/wd/MODEL_OUTPUT/RayTesting/27871_00004/checkpoint_000119/checkpoint-119
Traceback (most recent call last):
File ".source_code/Main_RLlib.py", line 248, in <module>
analysis = tune_run()
File ".source_code/Main_RLlib.py", line 224, in tune_run
analysis = tune.run(SimpleQTrainer, # SimpleQTrainer | PGTrainer
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/tune.py", line 741, in run
raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [SimpleQTrainer_RLlib_Gym_Environment_27871_00004, SimpleQTrainer_RLlib_Gym_Environment_27871_00002, SimpleQTrainer_RLlib_Gym_Environment_27871_00000, SimpleQTrainer_RLlib_Gym_Environment_27871_00005, SimpleQTrainer_RLlib_Gym_Environment_27871_00001, SimpleQTrainer_RLlib_Gym_Environment_27871_00003])
(I cut out the irrelevant parts)
I ran the above code on Azure Machine Learning service, but I can reproduce it locally when using the same Docker image.
I start the Docker container:
- If I kill the docker container before more than 5 iterations and then restart the container, resume always works, although sometimes Tune start from te beginning!
- Suppose I let a trail run for 10 iterations, if I kill the docker container slowly min 5 seconds after the 10th iteration finished and then, restart the container, resume work as expected, and the first iteration printed is 11 (based on a print from callback
log_trial_result
) . - Suppose I let a trail run for 10 iterations, if I kill the docker container quickly, max 1 seconds after the 10th iteration finished (based on a print from callback
log_trial_result
) and then, restart the container, resume fails with the above error. It tries to resume from the 9th iteration, and (usually) this has been deleted. If I inspect the output folder I can see the checkpoint foldercheckpoint_000010
exists, and the 9th was already deleted because ofkeep_checkpoints_num=5
.
The issue does not occur when using keep_checkpoints_num=None
. For this test, a checkpoint size was about <1MB. However, we are planning to store more things and then, when doing serious training, we will run in storage issues with keep_checkpoints_num=None
which will become difficult to manage.
Somehow, the presence of the checkpoint_000010
folder does not mean Tune want to use it for resuming? The absence of the checkpoint_000009
folder does not withhold Tune from trying to resume from here?
So how does Tune determine which checkpoint to use when resuming? Doen’t it look at these folders and is this information stored somewhere else?
Based on my above reproduction analysis, this issue seems to have a small probability of occurring, the Docker container has to be killed within 1 second after finishing an iteration. However, on Azure, resuming seems to almost always fail with this exact error. Debugging this on Azure is very slow and cumbersome so it’s difficult to collect more information.
My guess is that the storage on Azure is much slower because of (network) bandwidth limitations. Therefore, there is a much longer window in which the most recent checkpoint is still being written (or something) while the previous checkpoint has already been deleted (thus there is a larger open window for this race condition).
This is just a guess, and I cannot back this up right now. Maybe with some pointers to the Tune source code we can confirm/disprove this.
Question:
Am I doing something wrong? Or is this a bug and should I post this on the GitHub? Or is this a known issue (I cannot find anything about it)?
Any insight in about how the checkpointing mechanism works regarding this would help in investigating and solving this further.
One quick fix could be (but I haven’t tested it), to also keep the last x checkpoints besides only the 5 best checkpoints which are being kept right now. Then, Tune can always resume from the before last checkpoint.
Details:
I use Ray 1.13.0