Load prior `tune.run()` results from disk

Based Ray’s docs, I would expect the following to load the analysis object from a previous tune.run() process

from ray.tune import ExperimentAnalysis
analysis = ExperimentAnalysis("~/ray_results/example-experiment")

I used the RayTune + PyTorch Lightning tutorial as a guide for working with my own model/dataset. When the process completed, I saw the analysis.best_config printed to console, so I believe everything ran correctly. However, if I run the above snippet to try and load the analysis back into memory, I get the following error message

Could not load trials from experiment checkpoint. 
This means your experiment checkpoint is likely faulty or incomplete, 
and you won't have access to all analysis methods. 
Observed error: No module named ...

At first the module was gym, and then after pip installing now it’s tree. My model doesn’t use either of these, so I’m confused where these requirements are coming from. Before launching a large search space on a remote machine, I want to make sure that I can reliably retrieve the results of a prior tune.run() from disk in case the process crashes before printing analysis.best_config (in which case all the time spent on trials would’ve been wasted)

Hey @addisonklinke, thanks for reporting this. I took a look and found this workflow to be a little confusing/misleading as well. I created a Github issue to track this here: [Bug][Tune] Loading `ExperimentAnalysis` requires `Trainable` to be registered. · Issue #21212 · ray-project/ray · GitHub.

In short, the primary issue seems to be that the Trainable is not registered beforehand. As a result, the code tries to import some pre-defined RLlib Trainables, which then causes the error message you’ve observed.

To resolve this, you can add the following lines of code:

from ray.tune import register_trainable
register_trainable(training_function_name, training_function)

training_function_name should be the name of your training_function, but you can verify what the code expects this name to be by searching for trainable_name in the "~/ray_results/example-experiment/experiment_state-<date>.json file.

1 Like

@matthewdeng Thanks for the tip. Follow the PyTorch Lightning tutorial, would this be the correct implementation of your suggestion?

register_trainable('train_mnist_tune', train_mnist_tune)

For my training function (and I suspect the MNIST example as well), this raises

ValueError: Unknown argument found in the Trainable function. 
The function args must include a 'config' positional parameter. 
Any other args must be 'checkpoint_dir'

How then can I have a training function which takes other parameters such as number of GPUs, path to data on disk, etc? Do all of these belong in the config?

EDIT

By adding checkpoint_dir=None as the second parameter to train_mnist_tune(), I am able to avoid the above value error and include additional parameters. However, once tune.run() completes, I still get the original “could not load trials from experiment checkpoint” error when trying to initialize an ExperimentAnalysis object from the ~/ray_results folder.

I noticed the population based training section of the Lightning tutorial mentions adding checkpoints via TuneReportCheckpointCallback, so I tried this. Now there are checkpoint files in each trial folder, but it did not fix the analysis loading

EDIT 2

I see now that register_trainable() must be called within the same process that initializes ExperimentAnalysis. Previously I thought the registration was supposed to occur in the original tune.run() process, and that’s why it was not fixing the issue. With this modification, the proposed workaround is successful

Awesome! I’m glad you were able to find success here, and thanks for sharing the steps you took (if anyone else runs into this same issue I think they’ll really appreciate this).