Recommended way to evaluate training results


I am trying to understand and recreate results from major DQN/Rainbow papers using RLlib. What I have understood about training steps and evaluation steps (with the standard dqn_nature pre-processing relevant here being frame_stack=4), is as follows:

  1. Train for 50M time_steps (200M frames) which means for num_iterations=200, training_steps=250k, the total_time_steps or single_agent_steps are 200*250k=50M
  2. Every 1M time steps of training, run evaluation for 125 time_steps (500k frames). Truncate episodes at 27000 time_steps (108k frames)

My question is what is the recommended way to recreate this training and evaluation using RLlib? From the docs, I see that there are 2 ways to evaluate trained agents.

  1. Evaluating Trained Policies: which uses checkpoints and rollout to do evaluation for a particular number of timesteps.
  2. Customized Evaluation During Training: where one can do evaluation alongside training, and the evaluation settings described here in code and here in docs.

While both appear to have the same functionality with the second option being much more flexible, is one of them a better/recommended way to approach the re-create results problem? Will they have exactly similar output?

If I want the following setting:

training time_steps = 50M
evaluation every 1M steps for 125k steps
truncate evaluation episodes at 27000 steps

From what I can see, the corresponding config keys for RLlib could then be as follows (with questions in comments:

config={"timesteps_per_iteration" : 25000,
        "evaluation_interval": 40, # 25k*40=1M (eval every 1M) default: None,
        "evaluation_num_episodes": 10, #Q: I want to specify 125k steps here, and truncate episodes to 27k steps 
        "evaluation_parallel_to_training": True, #default: False,
        "in_evaluation": False, #Q Not sure, set this to True?
        "evaluation_config": {
                              "explore": False,
                              "timesteps_per_iteration" : 125000, # 125k steps for eval
        "evaluation_num_workers": 1 # default: 0,
        "custom_eval_function": None,

checkpoint_freq=40 # every 25000 timesteps_per_iteration x 40 iterations = 1M timesteps
stop = {"timesteps_total": 50000000} # 50M, stop=stop, checkpoint_freq=checkpoint_freq)

The specific questions would then be:

Q: does the evaluation_config overwrites the config keys during evaluation? This is written here in an example but was perhaps not very clear here in common parameters.

Q: If I set the above config, how do I control the number of iterations during evaluation? I can evaluate every 1M training steps, but it is not clear how to set 125k steps for every evaluation, and truncate an evaluation episode at 27k steps. The num_steps can be specified in rollout (method #1) but not entirely sure if the above config using method#2 can achieve the same.

Q: can evaluation_num_episodes help in this case? While this allows me to run a certain number of episodes per evaluation, the number of episodes is not known beforehand (AFAIK)

Thank you for looking into this. I will appreciate any help.

P.S I have gone through the and examples.


  • DQN Nature paper says “We trained for a total of 50 million frames” (is this per epoch? as the graphs show 200 training epochs)

  • Rainbow paper says “The average scores of the agent are evaluated during training, every 1M steps in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play)” The graphs are for 200 million frames.

  • Machado paper (revisiting ALE) evaluates for 10M, 50M, 100M and 200M frames. There are 5 number of trials.

  • Following are they config keys for the Evaluation during Training (second method):

    # === Evaluation Settings ===
    # Evaluate with every `evaluation_interval` training iterations.
    # The evaluation stats will be reported under the "evaluation" metric key.
    # Note that evaluation is currently not parallelized, and that for Ape-X
    # metrics are already only reported for the lowest epsilon workers.
    "evaluation_interval": None,
    # Number of episodes to run per evaluation period. If using multiple
    # evaluation workers, we will run at least this many episodes total.
    "evaluation_num_episodes": 10,
    # Whether to run evaluation in parallel to a Trainer.train() call
    # using threading. Default=False.
    # E.g. evaluation_interval=2 -> For every other training iteration,
    # the Trainer.train() and Trainer.evaluate() calls run in parallel.
    # Note: This is experimental. Possible pitfalls could be race conditions
    # for weight synching at the beginning of the evaluation loop.
    "evaluation_parallel_to_training": False,
    # Internal flag that is set to True for evaluation workers.
    "in_evaluation": False,
    # Typical usage is to pass extra args to evaluation env creator
    # and to disable exploration by computing deterministic actions.
    # IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
    # policy, even if this is a stochastic one. Setting "explore=False" here
    # will result in the evaluation workers not using this optimal policy!
    "evaluation_config": {
        # Example: overriding env_config, exploration, etc:
        # "env_config": {...},
        # "explore": False
    # Number of parallel workers to use for evaluation. Note that this is set
    # to zero by default, which means evaluation will be run in the trainer
    # process (only if evaluation_interval is not None). If you increase this,
    # it will increase the Ray resource usage of the trainer since evaluation
    # workers are created separately from rollout workers (used to sample data
    # for training).
    "evaluation_num_workers": 0,
    # Customize the evaluation method. This must be a function of signature
    # (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
    # Trainer.evaluate() method to see the default implementation. The
    # trainer guarantees all eval workers have the latest policy state before
    # this function is called.
    "custom_eval_function": None,
  • Following are the complete set of arguments for rollout:
rllib rollout --help
usage: rllib rollout [-h] --run RUN [--env ENV] [--local-mode] [--no-render] [--video-dir VIDEO_DIR] [--steps STEPS] [--episodes EPISODES] [--out OUT] [--config CONFIG] [--save-info] [--use-shelve]

Roll out a reinforcement learning agent given a checkpoint.

positional arguments:
  checkpoint            (Optional) checkpoint from which to roll out. If none given, will use an initial (untrained) Trainer.

optional arguments:
  -h, --help            show this help message and exit
  --local-mode          Run ray in local mode for easier debugging.
  --no-render           Suppress rendering of the environment.
  --video-dir VIDEO_DIR
                        Specifies the directory into which videos of all episode rollouts will be stored.
  --steps STEPS         Number of timesteps to roll out. Rollout will also stop if `--episodes` limit is reached first. A value of 0 means no limitation on the number of timesteps run.
  --episodes EPISODES   Number of complete episodes to roll out. Rollout will also stop if `--steps` (timesteps) limit is reached first. A value of 0 means no limitation on the number of episodes run.
  --out OUT             Output filename.
  --config CONFIG       Algorithm-specific configuration (e.g. env, hyperparams). Gets merged with loaded configuration from checkpoint file and `evaluation_config` settings therein.
  --save-info           Save the info field generated by the step() method, as well as the action, observations, rewards and done fields.
  --use-shelve          Save rollouts into a python shelf file (will save each episode as it is generated). An output filename must be set using --out.
  --track-progress      Write progress to a temporary file (updated after each episode). An output filename must be set using --out; the progress file will live in the same folder.

required named arguments:
  --run RUN             The algorithm or model to train. This may refer to the name of a built-on algorithm (e.g. RLLib's `DQN` or `PPO`), or a user-defined trainable function or class registered in the
                        tune registry.
  --env ENV             The environment specifier to use. This could be an openAI gym specifier (e.g. `CartPole-v0`) or a full class-path (e.g. `ray.rllib.examples.env.simple_corridor.SimpleCorridor`).

Example usage via RLlib CLI:
    rllib rollout /tmp/ray/checkpoint_dir/checkpoint-0 --run DQN
    --env CartPole-v0 --steps 1000000 --out rollouts.pkl

Example usage via executable:
    ./ /tmp/ray/checkpoint_dir/checkpoint-0 --run DQN
    --env CartPole-v0 --steps 1000000 --out rollouts.pkl

Example usage w/o checkpoint (for testing purposes):
    ./ --run PPO --env CartPole-v0 --episodes 500
1 Like