Loading experiment analysis from a different machine than the experiment was run with

luzgui · October 10, 2023, 3:29pm

hello, I am training a PPO agent with RLlib in one machine using windows and then I copy the experiment folder to a different machine with Linux for testing purposes.

To get the analysis object I perform the following operation hoping to be able to get get_best_trial() and get_best_checkpoint() and thenafter be able to build the PPO algorithm from the checkpoint:

analysis_object = ExperimentAnalysis(Linux_experiment_path,
                                         default_metric=metric, 
                                         default_mode=mode)

However, analysis_object path always refers to the original windows path producing errors.

What is the proper workflow in this case?

fardinabbasi · October 21, 2023, 1:16pm

same issue!
Please let me know if you find a solution.

justinvyu · October 23, 2023, 6:57pm

This is a known issue that’s being tracked here: [Train/Tune] Restore an experiment from a different machine/path · Issue #40585 · ray-project/ray · GitHub

Targeting a fix for Ray 2.9, but will keep this thread updated if a nightly is available earlier for you to use. Thanks for raising this issue.

luzgui · October 23, 2023, 9:20pm

Thank you very much @justinvyu
I am glad is something identified.
I have been trying to solve this for days before advancing since If I started to train models on the remote machine I could not load checkpoints on my local machine and test analyse, etc

justinvyu · November 3, 2023, 1:07am

This should be fixed in the nightly version of ray by this PR: [tune/train] Restore Tuner and results properly from moved storage path by justinvyu · Pull Request #40647 · ray-project/ray · GitHub.

https://docs.ray.io/en/latest/ray-overview/installation.html

Let me know if you get the chance to try it out!

luzgui · November 4, 2023, 10:44am

Thank you very much. As soon as I try I will inform here.
Best

lassefschmidt · December 5, 2023, 1:12pm

I face a similar issue. I created all my trials using ray 2.6.1 and could perfectly analyse them. Now I want to hand off the code to others but I see that no one can open my experiments. Even myself, when I create a new virtual environment, I cannot open the experiments anymore. (RuntimeError: Can’t return results as experiment has not been run, yet. Call Tuner.fit() to run the experiment first.)

Is there any workaround to this ? Tried opening the experiments using ray 2.6.1 and also the most recent 2.8.1 - same error each time.

Just re-running the experiments is not an option, so it would be quite painful to lose this data.

For reference, I am using this function to load a given experiment.

def open_validate_ray_experiment(experiment_path, trainable):
    # open & read experiment folder
    print(f"Loading results from {experiment_path}...")
    restored_tuner = tune.Tuner.restore(experiment_path, trainable = trainable, resume_unfinished = False)
    result_grid = restored_tuner.get_results()
    print("Done!\n")

    # Check if there have been errors
    if result_grid.errors:
        print(f"At least one of the {len(result_grid)} trials failed!")
    else:
        print(f"No errors! Number of terminated trials: {len(result_grid)}")
        
    return restored_tuner, result_grid

justinvyu · January 4, 2024, 11:47pm

The PR above should apply to Ray 2.9+. Let me know if you’re able to upgrade and try it out @lassefschmidt

Topic		Replies	Views
Load prior `tune.run()` results from disk Ray Tune	3	1220	December 21, 2021
Ray Tune - how to load trial results from a different location?	2	417	October 23, 2023
Ray restore checkpoint in rllib RLlib	6	1647	August 11, 2021
Selecting best checkpoint to keep training in tune Ray Tune	0	402	January 25, 2021
Restoring the best model without access to the Analysis object Ray Tune	0	275	January 29, 2021

Loading experiment analysis from a different machine than the experiment was run with

Related topics