I cannot resume a broken tune run

I’ve been trying to resume a broken tune run with no success. I get the following message:

2023-09-08 23:27:10,732 INFO experiment_state.py:388 -- Trying to find and download experiment checkpoint at gs://XXXXXXXXXXXXX
2023-09-08 23:28:20,726 INFO experiment_state.py:424 -- A remote experiment checkpoint was found and will be used to restore the previous experiment state.
2023-09-08 23:28:20,730 WARNING trial_runner.py:418 -- Attempting to resume experiment from XXXXXXXX. This will ignore any new changes to the specification.
2023-09-08 23:28:20,730 INFO trial_runner.py:422 -- Using the newest experiment state file found within the experiment directory: experiment_state-2023-09-08_23-01-35.json

But then, eventually, I get:

File "/home/ray/.local/lib/python3.9/site-packages/ray/tune/tune.py", line 1130, in run
    ea = ExperimentAnalysis(
  File "/home/ray/.local/lib/python3.9/site-packages/ray/tune/analysis/experiment_analysis.py", line 113, in __init__
    assert self._checkpoints_and_paths
AssertionError

I use the functional API to run the experiment (tune.run), and I tried all relevant variations of the resume parameter (AUTO, REMOTE - I sync the experiment to GCP buckets -, ERRORED, ERRORED_ONLY etc).

Also, I do not save checkpoints during the tuning process, so I would be OK with restarting all trials that errored (as long as it did not reach max_failures) or did not start yet.

Anything obvious that I am missing?

One other info (I could not edit the original text): the AssertionError arrises from the __init__ method in ExperimentAnalysis, but if I download the experiment checkpoint folder manually from GCP, then point ExperimentAnalysis to this path, I do not get this error (and the attribute _checkpoints_and_paths is populated). Might the automatic download initiated by tune.run be putting my files on some wrong path?

Got it! It seems I had some libraries lacking at the head node, in my case they were dm-tree, gymnasium, lz4 and tensorflow-probability.

Also, I had dinamically registered my trainable with the date in which my experiment began, but now we are several days ahead and tune.run would not find that trainable because the name was updated, now that I have them standardized, all is going well.