As I had to figure out how to analyze my experiment results after having run them on Google Cloud Platform (GCP) and synced checkpoints to Google Cloud Storage (GCS), I guess that other users might use my findings as well.
In the linked Colab notebook you find everything needed to run your analyses on Colab.
If anyone has suggestions of how to improve the analyses of cloud experiments, please add to this thread in comments.
I agree that remote checkpoint analysis does not work great at the moment. Would it help if we supported something like analysis = ExperimentAnalysis("gs://bucket/experiment")?
I would say in general yes. But, I also figured out that it is a little tricky with the file readings. Doing it locally is impossible as the amount of data is too large to be carried over the net.
With my approach I now encounter the following
df = analysis.dataframe(metric="episode_reward_mean", mode="max")
Couldn't read config from 160 paths
I do not know yet, if this is due to the fact that for reading files have to be pulled somehow or if experiments lack some specific data needed for it, I guess its the former.
Looking at the TrialCheckpoints received from get_best_checkpoint) the cloud_path is None and the local_path points to the file that originally existed on the head node of my cluster. I think when fusing the bucket to Colab we have a mangling of file paths and this is making it hard as even if the cloud_path would exist and point to the GCS bucket by using mounting this bucket the file paths would change on the Colab drive.
So, I guess having an ExperimentAnalysis that uses under the hood gsutil might be a solution.
@kai I guess, if we could in addition also store a relative path (inside the local_dir) in the trial objects and then path in the ExperimentAnalysis a path to the new local directory things should work out fine. I could support the PR if you guide me a little.
I also updated my Colab notebook to include some notes about the actual complications and how to create a work around via using the trials directly.
This sounds like a good idea, @Lars_Simon_Zehnder - alternatively there is ExperimentAnalysis._parse_cloud_path. We could probably rewrite this to generally support arbitrary experiment locations.
Generally it seems we don’t support moving the experiment folder to somewhere else, and this is definitely not great. I think in the first instance we should just repair this on initialization of experiment analysis - i.e. in load_trials_from_experiment_checkpoint change the trial_cp["logdir"] to the new location. You can try if this solves your problem and then open a PR for it?
Hey @Lars_Simon_Zehnder, this is great! Looking forward to your contribution.
The main issue is that the logdir attribute of the Trial object is absolute. We would want to have a path relative to the experiment directory. In order to achieve it, we can modify the create_logdir function inside ray/tune/trial.py so that it returns both an absolute and relative path, and save both of them as attributes. Then, inside experiment analysis, we would merge the experiment path with the relative paths inside the trial objects.
Happy to also chat about this in a meeting! Let me know what works for you.