Saving checkpoints with good custom_metric using tune.run()

RickLan · May 9, 2021, 7:03am

Using tune.run(), I’m trying to save some checkpoints with “good” custom_metric values. The custom_metric value is computed via a custom callback only during evaluation. Reading the tune.run docs, it seems that I could keep the last keep_checkpoints_num best checkpoints using checkpoint_score_attr as score.

If that is true, how can I use my custom metric as score?

tune.run(
  keep_checkpoints_num=3,
  checkpoint_score_attr="evaluation/custom_metrics/score"
)

Btw, this score metric shows up in tensorboard as

ray/tune/evaluation/custom_metrics/score_mean
ray/tune/evaluation/custom_metrics/score_max
ray/tune/evaluation/custom_metrics/score_min

If lower score is better, do I write:

tune.run(
  keep_checkpoints_num=3,
  checkpoint_score_attr="min-evaluation/custom_metrics/score"
)

or

tune.run(
  keep_checkpoints_num=3,
  checkpoint_score_attr="evaluation/custom_metrics/score",
  mode='min',
)

?

My tune.run() looks like this:

tune.run(
  "A3C", 
  name="study",
  config=config, 
  stop=stop, 
)

stefanbschneider · May 10, 2021, 3:28pm

I believe what you need is

analysis = tune.run(
  "A3C", 
  name="study",
  config=config, 
  stop=stop, 
  # make checkpoints every 10 iterations and at the end, keep the best 3
  checkpoint_at_end=True, checkpoint_freq=10, keep_checkpoints_num=10,
  # select the "best" checkpoints according to the max custom metric (mean)
  checkpoint_score_attr='custom_metrics/score_mean', mode='min',
)

If you want to configure where these checkpoints are saved, you can pass the path via the local_dir argument to tune.run().

The returned analysis object allows you to analyze your training results afterwards (see ExperimentAnalysis)
For example, you could get a Pandas data frame as follows:

# set your metric of interest first as default; or specify it in all following function calls 
analysis.default_metric = 'custom_metrics/score_mean' 
analysis.default_mode = 'min'
# get the data frame
df = analysis.dataframe()
# or get the best checkpoint path
checkpoint_path = analysis.get_best_checkpoint(trial=analysis.get_best_trial())

Does this answer your question?

RickLan · May 11, 2021, 7:51pm

Hi @stefanbschneider Thank you for the detailed answer.

If the metric is evaluation/episode_reward_mean, then I get the following errors during runtime:

2021-05-12 04:40:41,093 ERROR checkpoint_manager.py:145 -- Result dict has no key: evaluation/episode_reward_mean. checkpoint_score_attr must be set to a key in the result dict.

Are metrics from evaluation not supported? Or am I doing something wrong? @sven1977

Below is my self-contained test script:

import ray
from ray import tune

from ray.rllib.examples.env.random_env import RandomEnv

config = {
  "env": RandomEnv,
  "lr": 1e-4,
  "env_config" : {
    "max_episode_len" : 5,
  },
  "evaluation_interval" : 2,
  "evaluation_num_episodes": 1,
}

stop = {
  "training_iteration": 12,
}

ray.init()


#metric = 'evaluation/episode_reward_mean' # this has errors
metric = 'episode_reward_mean'
mode = 'max'
results = tune.run(
  "A3C",
  name="test",
  config=config, 
  stop=stop, 
  checkpoint_at_end=False, checkpoint_freq=2, keep_checkpoints_num=3,
  checkpoint_score_attr=metric,
  mode=mode,
  )

results.default_metric = metric
results.default_mode = mode
# get the data frame
df = results.dataframe()
print(list(df.columns))
# or get the best checkpoint path
checkpoint_path = results.get_best_checkpoint(trial=results.get_best_trial())
print(checkpoint_path)

ray.shutdown()

Edit: for the non-evaluation case, I do get the 3 checkpoints due to keep_checkpoints_num=3

stefanbschneider · May 12, 2021, 8:39am

Doesn’t episode_reward_mean instead of evaluation/episode_reward_mean solve your problem?
I always used simply episode_reward_mean as metric in my scripts. Isn’t that the reward during evaluation? Or am I confusing something here?

RickLan · May 12, 2021, 11:12am

evaluation/episode_reward_mean doesn’t quite work. I get the runtime error in my previous reply. There is only one checkpoint folder and is the last one. Intermediate ones are created, but are deleted. I expect there to be 3, which there are in the episode_reward_mean case.

In addition, I get the following errors:

2021-05-12 20:08:49,801 INFO tune.py:450 -- Total run time: 65.37 seconds (65.04 seconds for the tuning loop).
Traceback (most recent call last):
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'evaluation/episode_reward_mean'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./test_checkpoint_save.py", line 39, in <module>
    df = results.dataframe()
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py", line 107, in dataframe
    rows = self._retrieve_rows(metric=metric, mode=mode)
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py", line 280, in _retrieve_rows
    idx = df[metric].idxmax()
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'evaluation/episode_reward_mean'

stefanbschneider · May 12, 2021, 12:37pm

Hm, not sure about checkpoints during evaluation. What would that mean exactly?
Why not just use the existing checkpoints for episode_reward_mean?

Maybe @sven1977 can help out here?

RickLan · May 12, 2021, 12:57pm

Good questions. My environment is custom. I have randomness in training that I want to make deterministic in evaluation. In addition, I have additional analysis code that runs in the evaluation via the callback API. So in a perfect world, I want to use

checkpoint_score_attr='evaluation/custom_metrics/score_mean'

RickLan · May 12, 2021, 1:07pm

Or maybe you mean to suggest two stages? Capture checkpoints then separately reload each checkpoint and evaluate. This is a workaround, but tedious

mannyv · May 12, 2021, 3:31pm

Hi @RickLan,

I ran your example script above with ray 1.3.0 using metric = 'evaluation/episode_reward_mean'. I upped it to 16 iterations and tune.run finished fine for me and I had 3 checkpoints: “checkpoint_000006 checkpoint_000010 checkpoint_000016”.

The experiment_analysis part also failed for me and I think you uncovered a bug. There are two files in the trial directory that keep track of training results:progress.csv and result.json. The trial analysis uses the csv to create a dataframe but guess what, the evaluation results are not logged in the csv file for some reason. They are logged in the json file though.

RickLan · May 13, 2021, 12:01am

Hi @mannyv Thank you for helping. I have been running Ray 1.2. I repeated test with 1.3 and was able to reproduce your results: 3 checkpoint saved, and the evaluation metric is in the json, but not the csv.

mannyv · May 13, 2021, 1:20am

Glad it worked. I figured out the issue with the csv logger. The very first time it logs data using the “on_result” method is when it creates the file and determines the “fieldnames” (flattened keys) that it will log for the duration of the experiment.

In an example like yours where you are doing evaluations n > 1 the evaluation keys will not be in that first set of results and so in subsequent calls to “on_result” they will be ignored. This will also be true for your custom_metrics keys if you only add them when in evaluation.

I think it will be hard to change the behavior of the csv logger. Instead it might be easier to have the ExperimentAnanlysis class build the dataframe from the json log file.

mannyv · May 13, 2021, 2:24am

At least for this example here using the json log does work with these changes applied.

--- orig/ray/tune/analysis/experiment_analysis.py	2021-05-12 22:18:02.195126415 -0400
+++ fixed/ray/tune/analysis/experiment_analysis.py	2021-05-12 22:15:36.774961258 -0400
@@ -17,7 +17,7 @@
 
 from ray.tune.error import TuneError
 from ray.tune.result import DEFAULT_METRIC, EXPR_PROGRESS_FILE, \
-    EXPR_PARAM_FILE, CONFIG_PREFIX, TRAINING_ITERATION
+    EXPR_PARAM_FILE, CONFIG_PREFIX, TRAINING_ITERATION, EXPR_RESULT_FILE
 from ray.tune.trial import Trial
 from ray.tune.utils.trainable import TrainableUtil
 from ray.tune.utils.util import unflattened_lookup
@@ -172,9 +172,11 @@
         fail_count = 0
         for path in self._get_trial_paths():
             try:
-                self.trial_dataframes[path] = pd.read_csv(
-                    os.path.join(path, EXPR_PROGRESS_FILE))
-            except Exception:
+                data = [json.loads(line) for line in open(os.path.join(path, EXPR_RESULT_FILE), 'r').read().split('\n') if line]
+                self.trial_dataframes[path] = pd.json_normalize(data, sep="/")
+            except Exception as ex:
                 fail_count += 1
 
         if fail_count:
@@ -280,6 +282,8 @@
         assert mode is None or mode in ["max", "min"]
         rows = {}
         for path, df in self.trial_dataframes.items():
+            if metric not in df:
+                continue
             if mode == "max":
                 idx = df[metric].idxmax()
             elif mode == "min":

RickLan · May 18, 2021, 12:14am

@mannyv Thank you! Let me try.

sven1977 · May 19, 2021, 3:47pm

Looks like a Ray tune bug, then, correct?
@kai @rliaw @amogkam

rliaw · May 19, 2021, 7:28pm

Yeah, this looks like a tune bug! @RickLan could you file this issue on Github?

RickLan · May 20, 2021, 7:40am

@rliaw To be fair, @stefanbschneider 's code actually revealed the bug. I just ran it I’ll try to file an issue on GitHub.

MaximeBouton · July 20, 2021, 12:02pm

Hi! Has this been resolved? I could not find the corresponding github issue to keep track of this. Could you share the link? Thanks!

David_Wilt · July 20, 2021, 4:35pm

Here is code snippet from my callback example: Note that I only declare the root of the custom metric.

episode.custom_metrics[“sharpe”] = sharpe(returns.values)
episode.custom_metrics[“MDD”] = maximum_drawdown(pv)

Tune then creates three categories for each, e.g. …_mean/_min/_max, which I can successfully reference for stopping and other purposes. So my selected metric reads as follows:

custom_metrics/sharpe_mean

Hope this helps.

mannyv · July 20, 2021, 10:30pm

@MaximeBouton it was fixed with this PR but you have to explicitly request the json in the config.

github.com/ray-project/ray

[tune] allow to read trial results from json files in Analysis

ray-project:master ← llan-ml:tune-analysis-json

opened 04:30PM - 19 May 21 UTC

llan-ml

+56 -5

## Related issue number Closes #14390 ## Checks - [-] I've run `script…s/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

Topic		Replies	Views
Use `checkpoint_score_attr` with custom metric Ray Tune	3	503	May 11, 2022
Which attributes can be used in `checkpoint_score_attr` when using `tune.run` RLlib	10	1180	April 20, 2022
Custom metrics over evaluation only RLlib	8	1742	December 16, 2021
Store best checkpoints according to evaluation metrics Checkpointing, Restoring	0	375	June 19, 2023
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	487	October 18, 2021

Saving checkpoints with good custom_metric using tune.run()

Related topics