Dear all,
The Ray Tune ‘local_dir’ contains two *.json files which saves the expriment config which is used when resuming from a checkpoint. I noticed these files also store the ‘local_dir’ and other locations derived from it, as absolute paths.
Unfortunately, if you move the checkpoints to another location (or the location holding the checkpoints is mounted to a different path) results in Tune failing to resume.
To more clearly describe what is happening:
- I submit a experiment (and Docker image) to my cloud provider service (Azure Machine Learning).
- Azure will start the Docker container and mount the output storage to a path it likes. This path is proved to me as Environment Variable and as input argument to my Python script.
- The experiment runs fine for a few hours and create a checkpoint every 10 minutes.
- Because of computation capacity limitations, Azure stops our Docker container because we use discounted low priority nodes.
- When sufficient computational capacity is available again, Azure restarts our Docker container and mounts the output storage to a path it likes.
- Recently, Azure changed something and now the mount path is different after a restart
. No worries, as this new path is provided to me as Environment Variable and as input argument to my Python script. All my previously outputted files and Tune checkpoints are accessible via this path
.
-
I give Tune my new
local_dir
, Tune finds the checkpoints (e.g. checkpoint_001700). However, directly after I get theFileNotFoundError: [Errno Path does not exist]
(I checked, and checkpoint_001700 does exist). Which is logical as Tune starts using the path read from the checkpoint information which is a dead end now.
For the long term, it may be better for Tune to only store paths relative to the local_dir
, so if the contents of the local_dir
gets (virtually) moved, the checkpoints remain valid and usable.
But for the short term , is there any way to get around this? So I can resume my runs on Azure, even if the new Azure runtime changes the mounting point of my output storage on a restart?
Thank you!
If you want, I can file a GitHub issue about this .
For completeness, the relevant part of the output of my run:
2022-08-21 17:58:16,541 INFO trial_runner.py:515 -- A local experiment checkpoint was found and will be used to restore the previous experiment state.
# => Notice the '_2' post-fixed after the directory name: '4988b8e5bbe04c25add82f41f07603b4'. Azure makes this changes after a restart of the container, after every restart, this number gets incremented.
2022-08-21 17:58:16,754 WARNING trial_runner.py:644 -- Attempting to resume experiment from /mnt/azureml/cr/j/4988b8e5bbe04c25add82f41f07603b4_2/cap/data-capability/wd/MODEL_OUTPUT/RayTesting. This will ignore any new changes to the specification.
2022-08-21 17:58:16,815 INFO tune.py:647 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
2022-08-21 17:58:16,973 ERROR ray_trial_executor.py:533 -- Trial RLlib_Gym_Environment_bcc35_00000: Unexpected error starting runner.
Traceback (most recent call last):
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 526, in start_trial
return self._start_trial(trial)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 433, in _start_trial
self.restore(trial)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 744, in restore
logger.debug("Trial %s: Reading checkpoint into memory", trial)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 103, in checkpoint_to_object
data_dict = TrainableUtil.pickle_checkpoint(checkpoint_path)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 83, in pickle_checkpoint
checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 118, in find_checkpoint_dir
raise FileNotFoundError("Path does not exist", checkpoint_path)
# => Notice the '_2' is not present in the directory name : '4988b8e5bbe04c25add82f41f07603b4'. This is how the run started initially before a restart. This 'wrong / outdated' path comes from one of the Ray checkpoint files.
FileNotFoundError: [Errno Path does not exist] /mnt/azureml/cr/j/4988b8e5bbe04c25add82f41f07603b4/cap/data-capability/wd/MODEL_OUTPUT/RayTesting/bcc35_00000/checkpoint_001700/checkpoint-1700