I cannot resume a broken tune run

denmarc · September 8, 2023, 11:35pm

I’ve been trying to resume a broken tune run with no success. I get the following message:

2023-09-08 23:27:10,732 INFO experiment_state.py:388 -- Trying to find and download experiment checkpoint at gs://XXXXXXXXXXXXX
2023-09-08 23:28:20,726 INFO experiment_state.py:424 -- A remote experiment checkpoint was found and will be used to restore the previous experiment state.
2023-09-08 23:28:20,730 WARNING trial_runner.py:418 -- Attempting to resume experiment from XXXXXXXX. This will ignore any new changes to the specification.
2023-09-08 23:28:20,730 INFO trial_runner.py:422 -- Using the newest experiment state file found within the experiment directory: experiment_state-2023-09-08_23-01-35.json

But then, eventually, I get:

File "/home/ray/.local/lib/python3.9/site-packages/ray/tune/tune.py", line 1130, in run
    ea = ExperimentAnalysis(
  File "/home/ray/.local/lib/python3.9/site-packages/ray/tune/analysis/experiment_analysis.py", line 113, in __init__
    assert self._checkpoints_and_paths
AssertionError

I use the functional API to run the experiment (tune.run), and I tried all relevant variations of the resume parameter (AUTO, REMOTE - I sync the experiment to GCP buckets -, ERRORED, ERRORED_ONLY etc).

Also, I do not save checkpoints during the tuning process, so I would be OK with restarting all trials that errored (as long as it did not reach max_failures) or did not start yet.

Anything obvious that I am missing?

denmarc · September 10, 2023, 7:20pm

One other info (I could not edit the original text): the AssertionError arrises from the __init__ method in ExperimentAnalysis, but if I download the experiment checkpoint folder manually from GCP, then point ExperimentAnalysis to this path, I do not get this error (and the attribute _checkpoints_and_paths is populated). Might the automatic download initiated by tune.run be putting my files on some wrong path?

denmarc · September 10, 2023, 10:02pm

Got it! It seems I had some libraries lacking at the head node, in my case they were dm-tree, gymnasium, lz4 and tensorflow-probability.

Also, I had dinamically registered my trainable with the date in which my experiment began, but now we are several days ahead and tune.run would not find that trainable because the name was updated, now that I have them standardized, all is going well.

Topic		Replies	Views
Not able to resume experiment Ray Tune	5	966	December 12, 2022
Tune resume with trying older checkpoints Ray Tune	7	959	August 4, 2022
FileNotFoundError when resuming from Checkpoint Ray Tune	4	1318	August 11, 2022
Resuming experiment checkpoint hangs	4	226	November 1, 2023
Unable to restore Ray Tune previous experiment checkpoint Ray Tune	8	1001	June 1, 2023

I cannot resume a broken tune run

Related topics