Ray Tune stores absolute paths in checkpoints and cannot resume if checkpoints are moved

Dear all,

The Ray Tune ‘local_dir’ contains two *.json files which saves the expriment config which is used when resuming from a checkpoint. I noticed these files also store the ‘local_dir’ and other locations derived from it, as absolute paths.

Unfortunately, if you move the checkpoints to another location (or the location holding the checkpoints is mounted to a different path) results in Tune failing to resume.

To more clearly describe what is happening:

  1. I submit a experiment (and Docker image) to my cloud provider service (Azure Machine Learning).
  2. Azure will start the Docker container and mount the output storage to a path it likes. This path is proved to me as Environment Variable and as input argument to my Python script.
  3. The experiment runs fine for a few hours and create a checkpoint every 10 minutes.
  4. Because of computation capacity limitations, Azure stops our Docker container because we use discounted low priority nodes.
  5. When sufficient computational capacity is available again, Azure restarts our Docker container and mounts the output storage to a path it likes.
  6. Recently, Azure changed something and now the mount path is different after a restart :frowning: . No worries, as this new path is provided to me as Environment Variable and as input argument to my Python script. All my previously outputted files and Tune checkpoints are accessible via this path :slight_smile: .
  7. I give Tune my new local_dir, Tune finds the checkpoints (e.g. checkpoint_001700). However, directly after I get the FileNotFoundError: [Errno Path does not exist] (I checked, and checkpoint_001700 does exist). Which is logical as Tune starts using the path read from the checkpoint information which is a dead end now.

For the long term, it may be better for Tune to only store paths relative to the local_dir, so if the contents of the local_dir gets (virtually) moved, the checkpoints remain valid and usable.

But for the short term :slight_smile: , is there any way to get around this? So I can resume my runs on Azure, even if the new Azure runtime changes the mounting point of my output storage on a restart?

Thank you!
If you want, I can file a GitHub issue about this :slight_smile: .

For completeness, the relevant part of the output of my run:

2022-08-21 17:58:16,541	INFO trial_runner.py:515 -- A local experiment checkpoint was found and will be used to restore the previous experiment state.

# => Notice the '_2' post-fixed after the directory name: '4988b8e5bbe04c25add82f41f07603b4'. Azure makes this changes after a restart of the container, after every restart, this number gets incremented.
2022-08-21 17:58:16,754	WARNING trial_runner.py:644 -- Attempting to resume experiment from /mnt/azureml/cr/j/4988b8e5bbe04c25add82f41f07603b4_2/cap/data-capability/wd/MODEL_OUTPUT/RayTesting. This will ignore any new changes to the specification.

2022-08-21 17:58:16,815	INFO tune.py:647 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
2022-08-21 17:58:16,973	ERROR ray_trial_executor.py:533 -- Trial RLlib_Gym_Environment_bcc35_00000: Unexpected error starting runner.
Traceback (most recent call last):
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 526, in start_trial
    return self._start_trial(trial)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 433, in _start_trial
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 744, in restore
    logger.debug("Trial %s: Reading checkpoint into memory", trial)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 103, in checkpoint_to_object
    data_dict = TrainableUtil.pickle_checkpoint(checkpoint_path)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 83, in pickle_checkpoint
    checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
  File "/azureml-envs/azureml_3b2b862a3840fdeb7e2b03f15737dac0/lib/python3.9/site-packages/ray/tune/utils/trainable.py", line 118, in find_checkpoint_dir
    raise FileNotFoundError("Path does not exist", checkpoint_path)

# => Notice the '_2' is not present in the directory name : '4988b8e5bbe04c25add82f41f07603b4'. This is how the run started initially before a restart. This 'wrong / outdated' path comes from one of the Ray checkpoint files.
FileNotFoundError: [Errno Path does not exist] /mnt/azureml/cr/j/4988b8e5bbe04c25add82f41f07603b4/cap/data-capability/wd/MODEL_OUTPUT/RayTesting/bcc35_00000/checkpoint_001700/checkpoint-1700

I see. Thank you for reporting this. We should definitely fix this to make it more smooth! This is a great issue, could you actually file a github issue so that I can track it?
In the meantime, how about you just replace the strings in those two .json files to the new mount route?

1 Like

also cc @kai for visibility.

@xwjiang2010 thank you!

I will put the above issue also on GitHub.

I found that replacing all the strings in two *.json files would be a bit tricky. Mostly because of my limited understanding :wink: , such as, when multiple files are present, which one. There also seems some non-text stuff in one of the json files. But I think in the end I could make this work.

For now I found it easier to create symlinks from (all) the old paths to the new location. Locally it works, but to know it works on Azure I have to wait until I get a restart (sometimes it takes days).

1 Like