Using tune.run(), I’m trying to save some checkpoints with “good” custom_metric values. The custom_metric value is computed via a custom callback only during evaluation. Reading the tune.run docs, it seems that I could keep the last keep_checkpoints_num best checkpoints using checkpoint_score_attr as score.
If that is true, how can I use my custom metric as score?
analysis = tune.run(
"A3C",
name="study",
config=config,
stop=stop,
# make checkpoints every 10 iterations and at the end, keep the best 3
checkpoint_at_end=True, checkpoint_freq=10, keep_checkpoints_num=10,
# select the "best" checkpoints according to the max custom metric (mean)
checkpoint_score_attr='custom_metrics/score_mean', mode='min',
)
If you want to configure where these checkpoints are saved, you can pass the path via the local_dir argument to tune.run().
The returned analysis object allows you to analyze your training results afterwards (see ExperimentAnalysis)
For example, you could get a Pandas data frame as follows:
# set your metric of interest first as default; or specify it in all following function calls
analysis.default_metric = 'custom_metrics/score_mean'
analysis.default_mode = 'min'
# get the data frame
df = analysis.dataframe()
# or get the best checkpoint path
checkpoint_path = analysis.get_best_checkpoint(trial=analysis.get_best_trial())
If the metric is evaluation/episode_reward_mean, then I get the following errors during runtime:
2021-05-12 04:40:41,093 ERROR checkpoint_manager.py:145 -- Result dict has no key: evaluation/episode_reward_mean. checkpoint_score_attr must be set to a key in the result dict.
Are metrics from evaluation not supported? Or am I doing something wrong? @sven1977
Below is my self-contained test script:
import ray
from ray import tune
from ray.rllib.examples.env.random_env import RandomEnv
config = {
"env": RandomEnv,
"lr": 1e-4,
"env_config" : {
"max_episode_len" : 5,
},
"evaluation_interval" : 2,
"evaluation_num_episodes": 1,
}
stop = {
"training_iteration": 12,
}
ray.init()
#metric = 'evaluation/episode_reward_mean' # this has errors
metric = 'episode_reward_mean'
mode = 'max'
results = tune.run(
"A3C",
name="test",
config=config,
stop=stop,
checkpoint_at_end=False, checkpoint_freq=2, keep_checkpoints_num=3,
checkpoint_score_attr=metric,
mode=mode,
)
results.default_metric = metric
results.default_mode = mode
# get the data frame
df = results.dataframe()
print(list(df.columns))
# or get the best checkpoint path
checkpoint_path = results.get_best_checkpoint(trial=results.get_best_trial())
print(checkpoint_path)
ray.shutdown()
Edit: for the non-evaluation case, I do get the 3 checkpoints due to keep_checkpoints_num=3
Doesn’t episode_reward_mean instead of evaluation/episode_reward_mean solve your problem?
I always used simply episode_reward_mean as metric in my scripts. Isn’t that the reward during evaluation? Or am I confusing something here?
evaluation/episode_reward_mean doesn’t quite work. I get the runtime error in my previous reply. There is only one checkpoint folder and is the last one. Intermediate ones are created, but are deleted. I expect there to be 3, which there are in the episode_reward_mean case.
In addition, I get the following errors:
2021-05-12 20:08:49,801 INFO tune.py:450 -- Total run time: 65.37 seconds (65.04 seconds for the tuning loop).
Traceback (most recent call last):
File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'evaluation/episode_reward_mean'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./test_checkpoint_save.py", line 39, in <module>
df = results.dataframe()
File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py", line 107, in dataframe
rows = self._retrieve_rows(metric=metric, mode=mode)
File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py", line 280, in _retrieve_rows
idx = df[metric].idxmax()
File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/rick.lan/.pyenv/versions/cco-3.7.8/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'evaluation/episode_reward_mean'
Good questions. My environment is custom. I have randomness in training that I want to make deterministic in evaluation. In addition, I have additional analysis code that runs in the evaluation via the callback API. So in a perfect world, I want to use
I ran your example script above with ray 1.3.0 using metric = 'evaluation/episode_reward_mean'. I upped it to 16 iterations and tune.run finished fine for me and I had 3 checkpoints: “checkpoint_000006 checkpoint_000010 checkpoint_000016”.
The experiment_analysis part also failed for me and I think you uncovered a bug. There are two files in the trial directory that keep track of training results:progress.csv and result.json. The trial analysis uses the csv to create a dataframe but guess what, the evaluation results are not logged in the csv file for some reason. They are logged in the json file though.
Hi @mannyv Thank you for helping. I have been running Ray 1.2. I repeated test with 1.3 and was able to reproduce your results: 3 checkpoint saved, and the evaluation metric is in the json, but not the csv.
Glad it worked. I figured out the issue with the csv logger. The very first time it logs data using the “on_result” method is when it creates the file and determines the “fieldnames” (flattened keys) that it will log for the duration of the experiment.
In an example like yours where you are doing evaluations n > 1 the evaluation keys will not be in that first set of results and so in subsequent calls to “on_result” they will be ignored. This will also be true for your custom_metrics keys if you only add them when in evaluation.
I think it will be hard to change the behavior of the csv logger. Instead it might be easier to have the ExperimentAnanlysis class build the dataframe from the json log file.
Tune then creates three categories for each, e.g. …_mean/_min/_max, which I can successfully reference for stopping and other purposes. So my selected metric reads as follows: