Hi, I’ve been working on getting RLlib to train bots for my custom implemented PettingZoo environment (GitHub - elliottower/gobblet-rl: Interactive Multi-Agent Reinforcement Learning Environment for the board game Gobblet using PettingZoo.), but I’ve been having a lot of trouble even getting a basic RLlib algorithm to learn. Would appreciate any help with pointers for where to start/tips for debugging.
I started with the self_play_with_open_spiel example as I thought that would be a good way to benchmark it, if it’s able to beat previous versions of itself or a random agent, but I’m having trouble getting it to win even vs a random agent. I started with the hyperparameters from that example and changed them a bit with not much luck. I see from the documentation that you can use the ray.tune functionality to do hyperparameter sweeps but I’m not sure how that would fit into this example as it’s already calling Tuner() with a PPO object and doesn’t pass in an objective (as far as I can see).
My understanding is that by default RLlib doesn’t do any action masking, so you have to use wrappers like the ActionMaskModel, so I tried that as a second step, but following that example I ran into issues even with the new release ray 2.3.0, with the PPO or APPO objects (only two models that example works with) not having the attribute _warmup_time. I want to put this in my own repo and would like to be able to just include the official ray release rather than nightlies, but maybe these types of bugs are fixed in nightly releases.
For a little info about the game, the action space is Discrete(54), with a 3x3 board that has 3 sizes of pieces which can ‘gobble’ each other or move. I have a greedy agent which does a super basic 2-step tree search type thing to check if it can win this turn, or block the enemy from winning next turn, or do a move that lets it win next turn no matter what the enemy does. I’m fairly new to RL as a field so I wasn’t sure how to implement MCTS or tree search in general, my greedy agent uses an actual board object and does a sort of simulation, but that seems like cheating.