Custom metrics over evaluation only

LecJackS · November 24, 2020, 6:51pm

I’m moving this Slack discussion over here, so I’ll try to wrap it as clear and short as possible for future references.

What I want: A metric over evaluation episodes to save and load checkpoints based on that score.

Eg:

Train some agent to play Flappy birds,

=> Evaluate while training using a custom evironment evaluation config with unseen maps

=> use those evaluation scores as Metric.

In short: I can save Custom Metrics which I cannot use to save checkpoints (but I can see in tensorboard), and I can save values to result dictionary to load the successfully as checkpoint metric, but cannot have access to episode statistics from there.

Details: I want to add a checkpoint_score_attr here based on evaluation episode statistics

results = tune.run("IMPALA",
                    verbose=1,
                    num_samples=1,
                    config=config,
                    stop=stop,
                    checkpoint_freq=25,
                    checkpoint_at_end=True,
                    sync_on_checkpoint=False,
                    keep_checkpoints_num=50,
                    checkpoint_score_attr='evaluation/episode_reward_mean',
                    # or using custom metrics from a custom callback class
                    #checkpoint_score_attr='custom_metrics/episode_reward_mean',
                    ...)

Using this:

https://github.com/ray-project/ray/blob/master/rllib/examples/custom_metrics_and_callbacks.py

I created a similar CustomCallback class, where I can save manually statistics from evaluation episodes (basically the sum of rewards per episode), to then use them as checkpoint_score_attr.

But custom_metrics/..., saved in progress.csv, does not get saved to the result diccionary, causing an error as the attr used does not exists:

But if instead I save directly to the result dict like in the next image (with numeric values instead of ‘I would love to…’), it seems to loads the metric attr without errors, as it seems to add it to the result dict:

The problem with this approach is that inside on_train_result I do not have statistics of my episodes, as in the previous methods.

In short: I can save Custom Metrics which I cannot use to save checkpoints (but I can see in tensorboard), and I (think I) can save values to result dictionary to load the successfully as checkpoint metric, but cannot have access to episode statistics from there.

I have tried several approaches suggested in slack, without any success, and now I’m a bit lost.

sven1977 · November 27, 2020, 3:50pm

Thanks for asking this question. Could you create a small reproduction script(s), so I can more easily debug this?

LecJackS · December 2, 2020, 7:17pm

Hi Sven! I have been writing a notebook these days so the “issue” can be reproduced (and maybe later on write some kind of notebook tutorial example on this).

I created a simple environment where the rewards are proportional to the action’s ID taken but in evaluation mode, there is an inverse proportionality:

Train: Bigger action ID => Bigger reward

Test (evaluation mode): Bigger action ID => More negative reward

So as the agent learns, it will perform better in training mode, but worse in evaluation. I want to save checkpoints based on these evaluation metrics.

Simple “Run all” (Ctrl+F9) and search for “Here comes the issue”:

Google Colab

No matter what checkpoint_score_attr I try, could not use ‘test_return_mean’ inside custom_metrics.

What I want is to save checkpoints based on this “test_return_mean” metric (and later on, load the “best” best of these checkpoints).

Let me know if anything is unclear.

sven1977 · December 8, 2020, 12:25pm

Perfect, I’ll take a look.

GallantWood · December 3, 2021, 6:38am

@LecJackS @sven1977
I am curious to know a solution to this since I appear to have the same issue. I have a custom metric “my_metric” which I have defined “on_episode_end” and would like to use

checkpoint_score_attr=‘custom_metrics.my_metric_mean’

in tune.run(). However, I receive the error

ERROR checkpoint_manager.py:146 – Result dict has no key: custom_metrics.my_metric_mean. checkpoint_score_attr must be set to a key in the result dict.

When I look at the dataframe or result.json, then ‘custom_metrics.my_metric_mean’ does appear. However, it does not appear in the very first result log since some training takes place (a full epsiode) before the custom metric is logged and appears. From then on it is there. After the error is thrown, the system does not recognize the addition of the custom metric and just returns the default final checkpoint.

Is there an easy fix to log the custom metric from the start?

LecJackS · December 3, 2021, 2:44pm

@GallantWood Couldn’t make it work in my case. I just run several experiments for a large number of iteration, save checkpoints by iteration, and then choose the best checkpoint using some plots and statistics I run later on for each saved checkpoint.

mannyv · December 3, 2021, 3:27pm

@LecJackS @GallantWood

Have a look at this topic that might be relevant

GallantWood · December 13, 2021, 7:20am

Thank you @mannyv
That was helpful. It looks like the changes in the thread you linked to have been implemented in Ray ver 1.9 (which I am now using). Since my issue remains, I assume I need to follow your final response which is to “explicitly request the json in the config.”. Can you elaborate how to explicitly do this?

I could not find any config option in tune.run() for this purpose but I am a novice at using Ray. I also tried to set DEFAULT_FILE_TYPE to “json” to no avail. I guess I need to explicity force “set_filetype” in experiment_analysis.py to “json” but I am running on Colab and would prefer some kind of global fix/config setting as opposed to changing the Ray source files locally and loading them to Colab.

Just to clarify, my aim is to save checkpoints with respect to a custom metric (defined at episode end) during training as opposed to doing experiment analysis after training using this custom metric.

GallantWood · December 16, 2021, 6:18am

I have discovered that my particular issue has to do with some idiosyncratic behavior involving the interaction between “episode horizon” and checkpoint saving in RLlib. Therefore, if I can not resolve the problem, I will open a new issue specific to this.

Topic		Replies	Views
Saving checkpoints with good custom_metric using tune.run() Ray Tune	18	2301	July 20, 2021
Use `checkpoint_score_attr` with custom metric Ray Tune	3	510	May 11, 2022
Store best checkpoints according to evaluation metrics Checkpointing, Restoring	0	384	June 19, 2023
[Rllib, Tune, AIR] Checkpointing as per custom metric minimum Checkpointing, Restoring	5	57	July 2, 2025
Custom metric that combines train and eval results RLlib	2	452	June 10, 2022

Custom metrics over evaluation only

Related topics