I’m trying to train a Heads-up no-limit Texas hold’em agent using rllib’s PPO, with the environment being texas_holdem_no_limit from PettingZoo. During training, episode_reward_max remains 0, while episode_reward_min stays at -1. However, in my understanding, at the end of each round, the change in the player’s chip count should be used as the reward, and this value’s maximum should be greater than 0. Is this situation abnormal, or is my understanding incorrect?
Here is part of the training code:
ray.init(num_gpus=8)
env_name = "poker"
register_env(env_name, lambda _: PettingZooEnv(
texas_holdem_no_limit.env()
))
ModelCatalog.register_custom_model("BaselineModel", CNNModelV2)
config = (
PPOConfig()
.environment(env=env_name, clip_actions=True, disable_env_checking=True)
.rollouts(num_rollout_workers=4, rollout_fragment_length=128)
.resources(num_gpus=8)
.framework(framework="torch")
.debugging(log_level="ERROR")
.rl_module(_enable_rl_module_api=False)
.training(
_enable_learner_api=False,
train_batch_size=512,
lr=1e-4,
gamma=0.99,
lambda_=0.9,
use_gae=True,
clip_param=0.4,
grad_clip=None,
entropy_coeff=0.1,
vf_loss_coeff=0.25,
sgd_minibatch_size=64,
num_sgd_iter=10,
model= { "custom_model": "BaselineModel" }
)
)
tune.Tuner(
"PPO",
run_config=train.RunConfig(
checkpoint_config=train.CheckpointConfig(
checkpoint_frequency=10,
),
stop={"timesteps_total": 10000000 if not os.environ.get("CI") else 50000},
),
param_space=config,
).fit()
Training results: