Training mean reward vs. evaluation mean rewward

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi Im struggling get the same results when evaluating a trained model compared to the output from training - much lower mean reward.

Im having a custom env that

  • each reset initializes the env to one of 328 samples incrementing it one by one until it repeats itself again.
  • Each episode is around 100-120 timesteps and will only return done on the last timestep.

My training is setup like so

SELECT_ENV = "my_env"
register_env(env_name, env_creator)

experiment =
        "env": SELECT_ENV,
            #"framework": "tf2",
            #"lambda": 0.95,
            #"kl_coeff": 0.5,
            #"clip_rewards": True,
            #"clip_param": 0.3,
            #"vf_clip_param": 10.0,
            #"vf_share_layers": True,
            #"vf_loss_coeff": 1e-2,
            #"entropy_coeff": 0.01,
            #"train_batch_size": 10000,
            #"sample_batch_size": 130,
            #"sgd_minibatch_size": 130,
            #"num_sgd_iter": 10,
            "num_workers": 6,
            #"num_envs_per_worker": 16,
            #"lr": 0.0001,
            "gamma": 1.0,
            "batch_mode": "complete_episodes",
            "metrics_smoothing_episodes": 328,
            #"num_cpus": 4
            #"model": {'_use_default_native_models': False, '_disable_preprocessor_api': False, '_disable_action_flattening': False, 'fcnet_hiddens': [512, 512], 'fcnet_activation': 'tanh', 'conv_filters': None, 'conv_activation': 'relu', 'post_fcnet_hiddens': [], 'post_fcnet_activation': 'relu', 'free_log_std': False, 'no_final_linear': False, 'vf_share_layers': False, 'use_lstm': False, 'max_seq_len': 20, 'lstm_cell_size': 256, 'lstm_use_prev_action': False, 'lstm_use_prev_reward': False, '_time_major': False, 'use_attention': False, 'attention_num_transformer_units': 1, 'attention_dim': 64, 'attention_num_heads': 1, 'attention_head_dim': 32, 'attention_memory_inference': 50, 'attention_memory_training': 50, 'attention_position_wise_mlp_dim': 32, 'attention_init_gru_gate_bias': 2.0, 'attention_use_n_prev_actions': 0, 'attention_use_n_prev_rewards': 0, 'framestack': True, 'dim': 84, 'grayscale': False, 'zero_mean': True, 'custom_model': None, 'custom_model_config': {}, 'custom_action_dist': None, 'custom_preprocessor': None, 'lstm_use_prev_action_reward': -1}
    stop={"training_iteration": 250},

and the testcode running on the SAME 328 sample dataset like so

register_env(env_name, env_creator)

config = ppo.PPOConfig()
agent =

env = env_creator(config)
state = env.reset()

sum_reward = 0

episodes = 1
while True:
    #action = agent.compute_single_action(state)
    action = agent.compute_action(state)
    state, reward, done, info = env.step(action)

    #if(reward != 0):
    #    print(reward)
    sum_reward += reward
    if done:
        if (episodes == 328):
            state = env.reset()
            episodes += 1;

print(sum_reward / episodes)

=> 12736.807102917062
=> 328
=> 38.83172897230812

the mean reaward fom evaluation roughtly 38 while on tensorboard and training checkpoint it a much better mean reward of around 123…

check_point = experiment.get_trial_checkpoints_paths(trial=experiment.get_best_trial('episode_reward_mean'),

=> PPO_my_env_4cfa5_00000_0_2022-11-14_14-36-10\checkpoint_000250’, 123.2423709106124

Am I doing somthing wrong here? Thanks for any help , especially some hands one changes :smiley:

Hi @SVH,

If you train with a stochastic policy then you would expect your best performance if you also inferere and evaluate with a stochastic policy. You should keep explore=True.

I am not sure if you have any preprocessors but I think I remember @arturn saying that preprocessors are applied with compute_single_action but not compute_actions.

Policy methods don’t take preprocessing into account, algorithm methods (and RolloutWorkers) do!

Hi Arturn

After setting the explore=true like suggested by mannyv it does produce better / equivalent mean rewards compared to training. Multiple re-runs yield indeed stochastic results.

Question 1: Does this type of policy then become more deterministic with more training or is in its nature to remain stochastic? I mean less / reduced variance with more training. As of now the results vary greatly on multiple runs on the same sample as do multiple re-runs over all samples.

Question 2:
One of our observation space params are highly volitile like noise measurements depending on wind and other things - and of course not necessisarily with a pattern completely like in the training set when looking at small subset of an episode.
I thought appplying a mean std filter would potentially help by normalizing the observation space. would that seem to be a sensible approach?

But doing so gets me this pyhon exception.

ray.exceptions.RayTaskError(UFuncTypeError): ray::RolloutWorker.sample() (pid=28396, ip=, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000002C8E2A6D610>)
  File "python\ray\_raylet.pyx", line 662, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 666, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 613, in ray._raylet.execute_task.function_executor
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\_private\", line 674, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\util\tracing\", line 466, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\evaluation\", line 806, in sample
    batches = []
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\evaluation\", line 92, in next
    batches = [self.get_data()]
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\evaluation\", line 282, in get_data
    item = next(self._env_runner)
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\evaluation\", line 684, in _env_runner
    active_envs, to_eval, outputs = _process_observations(
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\evaluation\", line 936, in _process_observations
    filtered_obs: EnvObsType = _get_or_raise(worker.filters, policy_id)(
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\utils\", line 291, in __call__
    return _helper(x,, self.buffer, self.shape)
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\utils\", line 276, in _helper
  File "c:\users\SVH\anaconda3\lib\site-packages\ray\rllib\utils\", line 110, in push
    self._M[...] += delta / self._n
numpy.core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'add' output from dtype('O') to dtype('float64') with casting rule 'same_kind'

Would this be a bug (I can report it in github in case so). I can find many people seemingly using this filter on PPO.

In running Ray 2.0 on windows and python 3.8.5

The error occuring using the MeanStdFilter was caused by the observation space beeing a list (accepted by tf but not tf2) so when converting it to Numpy array the filter works just fine.