ExperimentAnalysis on Google Cloud Storage

Guideline to make experiment analyses on GCP

As I had to figure out how to analyze my experiment results after having run them on Google Cloud Platform (GCP) and synced checkpoints to Google Cloud Storage (GCS), I guess that other users might use my findings as well.

In the linked Colab notebook you find everything needed to run your analyses on Colab.

If anyone has suggestions of how to improve the analyses of cloud experiments, please add to this thread in comments.

@kai Are there some practices you can share?

Thanks Lars!

I agree that remote checkpoint analysis does not work great at the moment. Would it help if we supported something like analysis = ExperimentAnalysis("gs://bucket/experiment")?

@kai thanks for coming back to this so quickly.

I would say in general yes. But, I also figured out that it is a little tricky with the file readings. Doing it locally is impossible as the amount of data is too large to be carried over the net.

With my approach I now encounter the following

df = analysis.dataframe(metric="episode_reward_mean", mode="max")
Couldn't read config from 160 paths

I do not know yet, if this is due to the fact that for reading files have to be pulled somehow or if experiments lack some specific data needed for it, I guess its the former.

Looking at the TrialCheckpoints received from get_best_checkpoint) the cloud_path is None and the local_path points to the file that originally existed on the head node of my cluster. I think when fusing the bucket to Colab we have a mangling of file paths and this is making it hard as even if the cloud_path would exist and point to the GCS bucket by using mounting this bucket the file paths would change on the Colab drive.

So, I guess having an ExperimentAnalysis that uses under the hood gsutil might be a solution.

I narrowed it down. The Couldn't read config from 160 paths error is due to the fact that when I call analysis._get_trial_paths() I see the following:

['/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00004_fcnet_hiddens=64_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00010_fcnet_hiddens=64_16_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00002_fcnet_hiddens=16_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00009_fcnet_hiddens=32_32_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00014_fcnet_hiddens=128_32_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00007_fcnet_hiddens=16_16_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00005_fcnet_hiddens=128_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00013_fcnet_hiddens=128_16_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00003_fcnet_hiddens=32_2022-04-03_13-47-40',
 '/home/ray/ray_results/DQN/DQN_mini-grid_4d77d_00015_fcnet_hiddens=128_64_2022-04-03_13-47-40',

These are the local_paths of each trial.

@kai I guess, if we could in addition also store a relative path (inside the local_dir) in the trial objects and then path in the ExperimentAnalysis a path to the new local directory things should work out fine. I could support the PR if you guide me a little.

I also updated my Colab notebook to include some notes about the actual complications and how to create a work around via using the trials directly.

This sounds like a good idea, @Lars_Simon_Zehnder - alternatively there is ExperimentAnalysis._parse_cloud_path. We could probably rewrite this to generally support arbitrary experiment locations.

Generally it seems we don’t support moving the experiment folder to somewhere else, and this is definitely not great. I think in the first instance we should just repair this on initialization of experiment analysis - i.e. in load_trials_from_experiment_checkpoint change the trial_cp["logdir"] to the new location. You can try if this solves your problem and then open a PR for it?

Happy to help there, just tag me!

By the way, this issue seems related: [Bug] ExperiementAnalysis doesnt work if path has been changed · Issue #21050 · ray-project/ray · GitHub

Actually, let’s go with the relative paths. I think this is better maintainable and enables more use cases in the future.

Tagging @Yard1 who can either contribute this himself or help you with the contribution if you’re still up for it!

1 Like

Thanks for the clarification, @kai. I would like to make a PR for this.

@Yard1 Would you be able to guide me a little. I would probably be able to start next week with this.

Hey @Lars_Simon_Zehnder, this is great! Looking forward to your contribution.

The main issue is that the logdir attribute of the Trial object is absolute. We would want to have a path relative to the experiment directory. In order to achieve it, we can modify the create_logdir function inside ray/tune/trial.py so that it returns both an absolute and relative path, and save both of them as attributes. Then, inside experiment analysis, we would merge the experiment path with the relative paths inside the trial objects.

Happy to also chat about this in a meeting! Let me know what works for you.

1 Like