- High: It blocks me to complete my task.
I have a model that seems to train very well. Mean reward converges well above 100 including variance. When I try to serve the model or evaluate it running on the SAME dataset it only produces a mean reward below 40. Im struggling to figure out how to have the trained model produce similar performance as suggested by training.
About the model:
- Its a custom env that on each reset randomly resets the state to one of 328 samples
- The model will always only return done when there are no more observations in the sample which is around 100-140 steps
- There can be small rewards before episode termination but most positive / negative rewards will be at the end of the sample
Training
env_name = "my_env"
register_env(env_name, env_creator)
experiment = tune.run(
"PPO",
config={
"env": env_name,
#"framework": "tf2",
#"eager_tracing":True,
#"lambda": 0.95,
#"kl_coeff": 0.5,
#"clip_rewards": True,
#"clip_param": 0.3,
#"vf_clip_param": 10.0,
#"vf_share_layers": True,
#"vf_loss_coeff": 1e-2,
#"entropy_coeff": 0.01,
#"train_batch_size": 10000,
#"rollout_fragment_length": 140,
#"sample_batch_size": 130,
#"sgd_minibatch_size": 130,
#"num_sgd_iter": 10,
"num_workers": 6,
#"num_envs_per_worker": 16,
#"lr": 0.0001,
"gamma": 1.0,
"batch_mode": "complete_episodes",
"metrics_smoothing_episodes": 300,
#"num_cpus": 4
},
metric="episode_reward_mean",
mode="max",
stop={"training_iteration": 250},
checkpoint_at_end=True,
)
Evaluation
register_env(env_name, env_creator)
config = ppo.PPOConfig()
config.explore=False
agent = config.build(env=env_name)
agent.restore(checkpoint_path)
env = env_creator(config)
state = env.reset()
sum_reward = 0
episodes = 1
while True:
#action = agent.compute_single_action(state)
action = agent.compute_action(state)
state, reward, done, info = env.step(action)
#if(reward != 0):
# print(reward)
sum_reward += reward
if done:
if (episodes == 328):
break
else:
state = env.reset()
episodes += 1;
print(sum_reward)
print(episodes)
print(sum_reward / episodes)
MEan reaward accross episodes are closer to 40 than to 100+.
Im struggling to figure out why I cannot produce the training results even on the same dataset used for training. I did try to increate batch_size to 10.000 to have around 100 out of 328 samples fully play out in training.
If possible does anyone have some hands on thing I can try?