When training with rllib, episode_reward_max is always 0.

euxcet · June 10, 2024, 9:35pm

I’m trying to train a Heads-up no-limit Texas hold’em agent using rllib’s PPO, with the environment being texas_holdem_no_limit from PettingZoo. During training, episode_reward_max remains 0, while episode_reward_min stays at -1. However, in my understanding, at the end of each round, the change in the player’s chip count should be used as the reward, and this value’s maximum should be greater than 0. Is this situation abnormal, or is my understanding incorrect?

Here is part of the training code:

ray.init(num_gpus=8)

env_name = "poker"

register_env(env_name, lambda _: PettingZooEnv(
    texas_holdem_no_limit.env()
))
ModelCatalog.register_custom_model("BaselineModel", CNNModelV2)

config = (
    PPOConfig()
    .environment(env=env_name, clip_actions=True, disable_env_checking=True)
    .rollouts(num_rollout_workers=4, rollout_fragment_length=128)
    .resources(num_gpus=8)
    .framework(framework="torch")
    .debugging(log_level="ERROR")
    .rl_module(_enable_rl_module_api=False)
    .training(
        _enable_learner_api=False,
        train_batch_size=512,
        lr=1e-4,
        gamma=0.99,
        lambda_=0.9,
        use_gae=True,
        clip_param=0.4,
        grad_clip=None,
        entropy_coeff=0.1,
        vf_loss_coeff=0.25,
        sgd_minibatch_size=64,
        num_sgd_iter=10,
        model= { "custom_model": "BaselineModel" }
    )
)

tune.Tuner(
    "PPO",
    run_config=train.RunConfig(
        checkpoint_config=train.CheckpointConfig(
            checkpoint_frequency=10,
        ),
        stop={"timesteps_total": 10000000 if not os.environ.get("CI") else 50000},
    ),
    param_space=config,
).fit()

Training results:

Topic		Replies	Views
PPO.train incorrect result RLlib	1	260	May 23, 2023
When run PPO,it can not calculate episode reward	0	250	August 18, 2023
How to train better Configure Algorithm, Training, Evaluation, Scaling	0	122	March 29, 2024
Unable to replicate original PPO performance RLlib	0	177	May 10, 2024
Unexpected dramatic drop in reward RLlib	8	968	November 13, 2023

When training with rllib, episode_reward_max is always 0.

Related topics