Guideline to make experiment analyses on GCP
As I had to figure out how to analyze my experiment results after having run them on Google Cloud Platform (GCP) and synced checkpoints to Google Cloud Storage (GCS), I guess that other users might use my findings as well.
In the linked Colab notebook you find everything needed to run your analyses on Colab.
If anyone has suggestions of how to improve the analyses of cloud experiments, please add to this thread in comments.
@kai Are there some practices you can share?
I agree that remote checkpoint analysis does not work great at the moment. Would it help if we supported something like
analysis = ExperimentAnalysis("gs://bucket/experiment")?
@kai thanks for coming back to this so quickly.
I would say in general yes. But, I also figured out that it is a little tricky with the file readings. Doing it locally is impossible as the amount of data is too large to be carried over the net.
With my approach I now encounter the following
df = analysis.dataframe(metric="episode_reward_mean", mode="max")
Couldn't read config from 160 paths
I do not know yet, if this is due to the fact that for reading files have to be pulled somehow or if experiments lack some specific data needed for it, I guess its the former.
Looking at the
TrialCheckpoints received from
None and the
local_path points to the file that originally existed on the head node of my cluster. I think when fusing the bucket to Colab we have a mangling of file paths and this is making it hard as even if the
cloud_path would exist and point to the GCS bucket by using mounting this bucket the file paths would change on the Colab drive.
So, I guess having an
ExperimentAnalysis that uses under the hood gsutil might be a solution.
I narrowed it down. The
Couldn't read config from 160 paths error is due to the fact that when I call
analysis._get_trial_paths() I see the following:
These are the
local_paths of each trial.
@kai I guess, if we could in addition also store a relative path (inside the
local_dir) in the
trial objects and then path in the
ExperimentAnalysis a path to the new local directory things should work out fine. I could support the PR if you guide me a little.
I also updated my Colab notebook to include some notes about the actual complications and how to create a work around via using the trials directly.
This sounds like a good idea, @Lars_Simon_Zehnder - alternatively there is
ExperimentAnalysis._parse_cloud_path. We could probably rewrite this to generally support arbitrary experiment locations.
Generally it seems we don’t support moving the experiment folder to somewhere else, and this is definitely not great. I think in the first instance we should just repair this on initialization of experiment analysis - i.e. in
load_trials_from_experiment_checkpoint change the
trial_cp["logdir"] to the new location. You can try if this solves your problem and then open a PR for it?
Happy to help there, just tag me!
Actually, let’s go with the relative paths. I think this is better maintainable and enables more use cases in the future.
Tagging @Yard1 who can either contribute this himself or help you with the contribution if you’re still up for it!
Thanks for the clarification, @kai. I would like to make a PR for this.
@Yard1 Would you be able to guide me a little. I would probably be able to start next week with this.
Hey @Lars_Simon_Zehnder, this is great! Looking forward to your contribution.
The main issue is that the
logdir attribute of the
Trial object is absolute. We would want to have a path relative to the experiment directory. In order to achieve it, we can modify the
create_logdir function inside
ray/tune/trial.py so that it returns both an absolute and relative path, and save both of them as attributes. Then, inside experiment analysis, we would merge the experiment path with the relative paths inside the trial objects.
Happy to also chat about this in a meeting! Let me know what works for you.