Cannot reproduce training results in evaluation even on same dataset

  • High: It blocks me to complete my task.

I have a model that seems to train very well. Mean reward converges well above 100 including variance. When I try to serve the model or evaluate it running on the SAME dataset it only produces a mean reward below 40. Im struggling to figure out how to have the trained model produce similar performance as suggested by training.

About the model:

  1. Its a custom env that on each reset randomly resets the state to one of 328 samples
  2. The model will always only return done when there are no more observations in the sample which is around 100-140 steps
  3. There can be small rewards before episode termination but most positive / negative rewards will be at the end of the sample


env_name = "my_env"
register_env(env_name, env_creator)

experiment =
        "env": env_name,
            #"framework": "tf2",
            #"lambda": 0.95,
            #"kl_coeff": 0.5,
            #"clip_rewards": True,
            #"clip_param": 0.3,
            #"vf_clip_param": 10.0,
            #"vf_share_layers": True,
            #"vf_loss_coeff": 1e-2,
            #"entropy_coeff": 0.01,
            #"train_batch_size": 10000,
            #"rollout_fragment_length": 140,
            #"sample_batch_size": 130,
            #"sgd_minibatch_size": 130,
            #"num_sgd_iter": 10,
            "num_workers": 6,
            #"num_envs_per_worker": 16,
            #"lr": 0.0001,
            "gamma": 1.0,
            "batch_mode": "complete_episodes",
            "metrics_smoothing_episodes": 300,
            #"num_cpus": 4
    stop={"training_iteration": 250},


register_env(env_name, env_creator)

config = ppo.PPOConfig()
agent =

env = env_creator(config)
state = env.reset()

sum_reward = 0

episodes = 1
while True:
    #action = agent.compute_single_action(state)
    action = agent.compute_action(state)
    state, reward, done, info = env.step(action)

    #if(reward != 0):
    #    print(reward)
    sum_reward += reward
    if done:
        if (episodes == 328):
            state = env.reset()
            episodes += 1;

print(sum_reward / episodes)

MEan reaward accross episodes are closer to 40 than to 100+.

Im struggling to figure out why I cannot produce the training results even on the same dataset used for training. I did try to increate batch_size to 10.000 to have around 100 out of 328 samples fully play out in training.


If possible does anyone have some hands on thing I can try?

@SVH This has been answered by @mannyv in the reply to a similar problem of yours, I guess.

The explore attribute for evaluation has to be set to True to achieve comparable results to training.