Based Ray’s docs, I would expect the following to load the analysis
object from a previous tune.run()
process
from ray.tune import ExperimentAnalysis
analysis = ExperimentAnalysis("~/ray_results/example-experiment")
I used the RayTune + PyTorch Lightning tutorial as a guide for working with my own model/dataset. When the process completed, I saw the analysis.best_config
printed to console, so I believe everything ran correctly. However, if I run the above snippet to try and load the analysis back into memory, I get the following error message
Could not load trials from experiment checkpoint.
This means your experiment checkpoint is likely faulty or incomplete,
and you won't have access to all analysis methods.
Observed error: No module named ...
At first the module was gym
, and then after pip installing now it’s tree
. My model doesn’t use either of these, so I’m confused where these requirements are coming from. Before launching a large search space on a remote machine, I want to make sure that I can reliably retrieve the results of a prior tune.run()
from disk in case the process crashes before printing analysis.best_config
(in which case all the time spent on trials would’ve been wasted)
Hey @addisonklinke, thanks for reporting this. I took a look and found this workflow to be a little confusing/misleading as well. I created a Github issue to track this here: [Bug][Tune] Loading `ExperimentAnalysis` requires `Trainable` to be registered. · Issue #21212 · ray-project/ray · GitHub.
In short, the primary issue seems to be that the Trainable is not registered beforehand. As a result, the code tries to import some pre-defined RLlib Trainables, which then causes the error message you’ve observed.
To resolve this, you can add the following lines of code:
from ray.tune import register_trainable
register_trainable(training_function_name, training_function)
training_function_name
should be the name of your training_function
, but you can verify what the code expects this name to be by searching for trainable_name
in the "~/ray_results/example-experiment/experiment_state-<date>.json
file.
1 Like
@matthewdeng Thanks for the tip. Follow the PyTorch Lightning tutorial, would this be the correct implementation of your suggestion?
register_trainable('train_mnist_tune', train_mnist_tune)
For my training function (and I suspect the MNIST example as well), this raises
ValueError: Unknown argument found in the Trainable function.
The function args must include a 'config' positional parameter.
Any other args must be 'checkpoint_dir'
How then can I have a training function which takes other parameters such as number of GPUs, path to data on disk, etc? Do all of these belong in the config?
EDIT
By adding checkpoint_dir=None
as the second parameter to train_mnist_tune()
, I am able to avoid the above value error and include additional parameters. However, once tune.run()
completes, I still get the original “could not load trials from experiment checkpoint” error when trying to initialize an ExperimentAnalysis
object from the ~/ray_results
folder.
I noticed the population based training section of the Lightning tutorial mentions adding checkpoints via TuneReportCheckpointCallback
, so I tried this. Now there are checkpoint files in each trial folder, but it did not fix the analysis loading
EDIT 2
I see now that register_trainable()
must be called within the same process that initializes ExperimentAnalysis
. Previously I thought the registration was supposed to occur in the original tune.run()
process, and that’s why it was not fixing the issue. With this modification, the proposed workaround is successful
Awesome! I’m glad you were able to find success here, and thanks for sharing the steps you took (if anyone else runs into this same issue I think they’ll really appreciate this).