Tips for tuning in a competitive multi-agent turn based environment

Hi there,

I’m looking for some guidance to help tune my ppo agent for a card game environment. To know whether training is going successfully I run a custom evaluation that plays a few hundred games verses a simple heuristic policy I wrote to imitate a novice player.

With the default PPO settings the evaluation win rate increase starts to drop off after the first 400ish training iterations at around 50% which isn’t bad. After this progress definitely plateaus and the following 1000 iters only result in a slight increase in the win rate. This is with the default settings. I’m hoping with some parameter tuning I’ll be able to achieve better results.

A few questions on using tune in this setting:

  1. What metric should I try and optimize?
    • The one that really counts is how well the agent does against my heuristic agent
    • I would want to maximise the win rate against the heuristic agent, or better maximise the average score of the opponent agent (average losing score * losing rate)
  2. When should a trail end? The best ones so far converge to around 0.55 in about 1M steps.
    • larger batch size seemed to work better. Meant far fewer training iterations at least
  3. Is it possible to get out of a plateau?
    • At the moment all my agents tend to converge to around a 0.5-0.6 win-rate vs my agent just training on self-play.
    • Perhaps I can use a checkpoint of an agent that has already reached 50% win rate and tune this to try and gain an increase?
  4. How often would I have to run the evaluation script?
    • Since when tuning I think the trainer is required to return the metric that is being tuned each training iteration. So that seems to suggest it requires an evaluation to be carried out every iteration.

Do you have any recommendations for ranges for the hyper parameters. For context my environment has about 300 input states and about 500 discrete actions. Once the agent has become relatively good it usually takes about 40 steps per episode (before it gets its reward, no dense reward). It also requires two steps per player, discard a card, then choose one to pick up. I’ve modelled this as separate steps. I could have concatenated the pick up action on to the discard action but that basically multiplies the action space by 3 and didn’t seem to give any additional benefits when I tried it.

Cheers,

Rory

1 Like

@sven1977 please take a look

  1. The win-rate vs fixed agent should converge to 1.0. Did you try self-play? Like train your agent against some earlier version of itself? This would increase the difficulty automatically over time.
  2. Try to cover as much hyperparam space as you can and learn for as long as you see increases. There is really no such thing as over-training in RL :), since you just want to get your agent to perform well on the particular task.
  3. Not sure. I’m not familiar with the specific game dynamics.
  4. No, tune does not use the RLlib evaluation results (evaluation_interval>0). In cases like PBT, I think it uses the mean episode rewards by default (but can you ask on “Ray tune”, how in particular to set this up).