Custom metrics over evaluation only

I’m moving this Slack discussion over here, so I’ll try to wrap it as clear and short as possible for future references.

What I want: A metric over evaluation episodes to save and load checkpoints based on that score.

Eg:

Train some agent to play Flappy birds,

=> Evaluate while training using a custom evironment evaluation config with unseen maps

=> use those evaluation scores as Metric.

In short: I can save Custom Metrics which I cannot use to save checkpoints (but I can see in tensorboard), and I can save values to result dictionary to load the successfully as checkpoint metric, but cannot have access to episode statistics from there.

Details: I want to add a checkpoint_score_attr here based on evaluation episode statistics

results = tune.run("IMPALA",
                    verbose=1,
                    num_samples=1,
                    config=config,
                    stop=stop,
                    checkpoint_freq=25,
                    checkpoint_at_end=True,
                    sync_on_checkpoint=False,
                    keep_checkpoints_num=50,
                    checkpoint_score_attr='evaluation/episode_reward_mean',
                    # or using custom metrics from a custom callback class
                    #checkpoint_score_attr='custom_metrics/episode_reward_mean',
                    ...)

Using this:

https://github.com/ray-project/ray/blob/master/rllib/examples/custom_metrics_and_callbacks.py

I created a similar CustomCallback class, where I can save manually statistics from evaluation episodes (basically the sum of rewards per episode), to then use them as checkpoint_score_attr.

But custom_metrics/..., saved in progress.csv, does not get saved to the result diccionary, causing an error as the attr used does not exists:

image

image

image

But if instead I save directly to the result dict like in the next image (with numeric values instead of ‘I would love to…’), it seems to loads the metric attr without errors, as it seems to add it to the result dict:
image

The problem with this approach is that inside on_train_result I do not have statistics of my episodes, as in the previous methods.

In short: I can save Custom Metrics which I cannot use to save checkpoints (but I can see in tensorboard), and I (think I) can save values to result dictionary to load the successfully as checkpoint metric, but cannot have access to episode statistics from there.

I have tried several approaches suggested in slack, without any success, and now I’m a bit lost.

1 Like

Thanks for asking this question. Could you create a small reproduction script(s), so I can more easily debug this?

Hi Sven! I have been writing a notebook these days so the “issue” can be reproduced (and maybe later on write some kind of notebook tutorial example on this).

I created a simple environment where the rewards are proportional to the action’s ID taken but in evaluation mode, there is an inverse proportionality:

Train: Bigger action ID => Bigger reward

Test (evaluation mode): Bigger action ID => More negative reward

So as the agent learns, it will perform better in training mode, but worse in evaluation. I want to save checkpoints based on these evaluation metrics.

Simple “Run all” (Ctrl+F9) and search for “Here comes the issue:

https://colab.research.google.com/drive/1IMCmbIeKswUHb_xO4Gf3hcwl4pHZE2jV?usp=sharing

No matter what checkpoint_score_attr I try, could not use ‘test_return_mean’ inside custom_metrics.

What I want is to save checkpoints based on this “test_return_mean” metric (and later on, load the “best” best of these checkpoints).

Let me know if anything is unclear.

1 Like

Perfect, I’ll take a look.

1 Like