I’ve been trying to resume a broken tune run with no success. I get the following message:
2023-09-08 23:27:10,732 INFO experiment_state.py:388 -- Trying to find and download experiment checkpoint at gs://XXXXXXXXXXXXX
2023-09-08 23:28:20,726 INFO experiment_state.py:424 -- A remote experiment checkpoint was found and will be used to restore the previous experiment state.
2023-09-08 23:28:20,730 WARNING trial_runner.py:418 -- Attempting to resume experiment from XXXXXXXX. This will ignore any new changes to the specification.
2023-09-08 23:28:20,730 INFO trial_runner.py:422 -- Using the newest experiment state file found within the experiment directory: experiment_state-2023-09-08_23-01-35.json
But then, eventually, I get:
File "/home/ray/.local/lib/python3.9/site-packages/ray/tune/tune.py", line 1130, in run
ea = ExperimentAnalysis(
File "/home/ray/.local/lib/python3.9/site-packages/ray/tune/analysis/experiment_analysis.py", line 113, in __init__
assert self._checkpoints_and_paths
AssertionError
I use the functional API to run the experiment (tune.run
), and I tried all relevant variations of the resume
parameter (AUTO, REMOTE - I sync the experiment to GCP buckets -, ERRORED, ERRORED_ONLY etc).
Also, I do not save checkpoints during the tuning process, so I would be OK with restarting all trials that errored (as long as it did not reach max_failures
) or did not start yet.
Anything obvious that I am missing?