Saving checkpoints with good custom_metric using

Using, I’m trying to save some checkpoints with “good” custom_metric values. The custom_metric value is computed via a custom callback only during evaluation. Reading the docs, it seems that I could keep the last keep_checkpoints_num best checkpoints using checkpoint_score_attr as score.

If that is true, how can I use my custom metric as score?

Btw, this score metric shows up in tensorboard as

  • ray/tune/evaluation/custom_metrics/score_mean
  • ray/tune/evaluation/custom_metrics/score_max
  • ray/tune/evaluation/custom_metrics/score_min

If lower score is better, do I write:



My looks like this:

I believe what you need is

analysis =
  # make checkpoints every 10 iterations and at the end, keep the best 3
  checkpoint_at_end=True, checkpoint_freq=10, keep_checkpoints_num=10,
  # select the "best" checkpoints according to the max custom metric (mean)
  checkpoint_score_attr='custom_metrics/score_mean', mode='min',

If you want to configure where these checkpoints are saved, you can pass the path via the local_dir argument to

The returned analysis object allows you to analyze your training results afterwards (see ExperimentAnalysis)
For example, you could get a Pandas data frame as follows:

# set your metric of interest first as default; or specify it in all following function calls 
analysis.default_metric = 'custom_metrics/score_mean' 
analysis.default_mode = 'min'
# get the data frame
df = analysis.dataframe()
# or get the best checkpoint path
checkpoint_path = analysis.get_best_checkpoint(trial=analysis.get_best_trial())

Does this answer your question?

1 Like

Hi @stefanbschneider Thank you for the detailed answer.

If the metric is evaluation/episode_reward_mean, then I get the following errors during runtime:

2021-05-12 04:40:41,093 ERROR -- Result dict has no key: evaluation/episode_reward_mean. checkpoint_score_attr must be set to a key in the result dict.

Are metrics from evaluation not supported? Or am I doing something wrong? @sven1977

Below is my self-contained test script:

import ray
from ray import tune

from ray.rllib.examples.env.random_env import RandomEnv

config = {
  "env": RandomEnv,
  "lr": 1e-4,
  "env_config" : {
    "max_episode_len" : 5,
  "evaluation_interval" : 2,
  "evaluation_num_episodes": 1,

stop = {
  "training_iteration": 12,


#metric = 'evaluation/episode_reward_mean' # this has errors
metric = 'episode_reward_mean'
mode = 'max'
results =
  checkpoint_at_end=False, checkpoint_freq=2, keep_checkpoints_num=3,

results.default_metric = metric
results.default_mode = mode
# get the data frame
df = results.dataframe()
# or get the best checkpoint path
checkpoint_path = results.get_best_checkpoint(trial=results.get_best_trial())


Edit: for the non-evaluation case, I do get the 3 checkpoints due to keep_checkpoints_num=3

Doesn’t episode_reward_mean instead of evaluation/episode_reward_mean solve your problem?
I always used simply episode_reward_mean as metric in my scripts. Isn’t that the reward during evaluation? Or am I confusing something here?

evaluation/episode_reward_mean doesn’t quite work. I get the runtime error in my previous reply. There is only one checkpoint folder and is the last one. Intermediate ones are created, but are deleted. I expect there to be 3, which there are in the episode_reward_mean case.

In addition, I get the following errors:

2021-05-12 20:08:49,801 INFO -- Total run time: 65.37 seconds (65.04 seconds for the tuning loop).
Traceback (most recent call last):
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/indexes/", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'evaluation/episode_reward_mean'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./", line 39, in <module>
    df = results.dataframe()
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/ray/tune/analysis/", line 107, in dataframe
    rows = self._retrieve_rows(metric=metric, mode=mode)
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/ray/tune/analysis/", line 280, in _retrieve_rows
    idx = df[metric].idxmax()
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/indexes/", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'evaluation/episode_reward_mean'

Hm, not sure about checkpoints during evaluation. What would that mean exactly?
Why not just use the existing checkpoints for episode_reward_mean?

Maybe @sven1977 can help out here?

Good questions. My environment is custom. I have randomness in training that I want to make deterministic in evaluation. In addition, I have additional analysis code that runs in the evaluation via the callback API. So in a perfect world, I want to use



Or maybe you mean to suggest two stages? Capture checkpoints then separately reload each checkpoint and evaluate. This is a workaround, but tedious :stuck_out_tongue:

Hi @RickLan,

I ran your example script above with ray 1.3.0 using metric = 'evaluation/episode_reward_mean'. I upped it to 16 iterations and finished fine for me and I had 3 checkpoints: “checkpoint_000006 checkpoint_000010 checkpoint_000016”.

The experiment_analysis part also failed for me and I think you uncovered a bug. There are two files in the trial directory that keep track of training results:progress.csv and result.json. The trial analysis uses the csv to create a dataframe but guess what, the evaluation results are not logged in the csv file for some reason. They are logged in the json file though.


Hi @mannyv Thank you for helping. I have been running Ray 1.2. I repeated test with 1.3 and was able to reproduce your results: 3 checkpoint saved, and the evaluation metric is in the json, but not the csv.

1 Like

Glad it worked. I figured out the issue with the csv logger. The very first time it logs data using the “on_result” method is when it creates the file and determines the “fieldnames” (flattened keys) that it will log for the duration of the experiment.

In an example like yours where you are doing evaluations n > 1 the evaluation keys will not be in that first set of results and so in subsequent calls to “on_result” they will be ignored. This will also be true for your custom_metrics keys if you only add them when in evaluation.

I think it will be hard to change the behavior of the csv logger. Instead it might be easier to have the ExperimentAnanlysis class build the dataframe from the json log file.

At least for this example here using the json log does work with these changes applied.

--- orig/ray/tune/analysis/	2021-05-12 22:18:02.195126415 -0400
+++ fixed/ray/tune/analysis/	2021-05-12 22:15:36.774961258 -0400
@@ -17,7 +17,7 @@
 from ray.tune.error import TuneError
 from ray.tune.result import DEFAULT_METRIC, EXPR_PROGRESS_FILE, \
 from ray.tune.trial import Trial
 from ray.tune.utils.trainable import TrainableUtil
 from ray.tune.utils.util import unflattened_lookup
@@ -172,9 +172,11 @@
         fail_count = 0
         for path in self._get_trial_paths():
-                self.trial_dataframes[path] = pd.read_csv(
-                    os.path.join(path, EXPR_PROGRESS_FILE))
-            except Exception:
+                data = [json.loads(line) for line in open(os.path.join(path, EXPR_RESULT_FILE), 'r').read().split('\n') if line]
+                self.trial_dataframes[path] = pd.json_normalize(data, sep="/")
+            except Exception as ex:
                 fail_count += 1
         if fail_count:
@@ -280,6 +282,8 @@
         assert mode is None or mode in ["max", "min"]
         rows = {}
         for path, df in self.trial_dataframes.items():
+            if metric not in df:
+                continue
             if mode == "max":
                 idx = df[metric].idxmax()
             elif mode == "min":
1 Like

@mannyv Thank you! Let me try.

1 Like

Looks like a Ray tune bug, then, correct?
@kai @rliaw @amogkam

Yeah, this looks like a tune bug! @RickLan could you file this issue on Github?

@rliaw To be fair, @stefanbschneider 's code actually revealed the bug. I just ran it :stuck_out_tongue: I’ll try to file an issue on GitHub.

1 Like

Hi! Has this been resolved? I could not find the corresponding github issue to keep track of this. Could you share the link? Thanks!

Here is code snippet from my callback example: Note that I only declare the root of the custom metric.

episode.custom_metrics[“sharpe”] = sharpe(returns.values)
episode.custom_metrics[“MDD”] = maximum_drawdown(pv)

Tune then creates three categories for each, e.g. …_mean/_min/_max, which I can successfully reference for stopping and other purposes. So my selected metric reads as follows:


Hope this helps.

@MaximeBouton it was fixed with this PR but you have to explicitly request the json in the config.

1 Like