Here’s a Team-Battle game out of Abmarl’s GridWorld: Abmarl/team_battle_example.py at main · LLNL/Abmarl · GitHub. This is actually a Dict, not a Tuple; I believe they work very similarly. Actually, I prefer the Dict because they key is descriptive. I trained a similar version of this game with RLlib and generated good results.
Can you please share link to that rllib code so that I can understand it better?
I don’t have the exact code as I used for training that use case available. Here’s an example script using a different multi-agent environment. Although this one does not have Dict action space, the approach is the same since that detail is abstracted in the RLlib framework.
self.action_space = spaces.Dict(
config = ppo.DEFAULT_CONFIG.copy()
config[‘num_workers’] = 4
config[‘horizon’] = timestep_limit_per_episode
The ‘episode reward mean’ is not increasing with more iterations for either algorithm. Same algo + env settings converged with discrete action space. Any suggestions on how to make it work for above dict space?
Hi Saurabh, thank for using RLlib.
There are many many reasons that may stop an RL stack from learning. We need more information about your setup and the environment to debug here.
also for an example of using dict/tuple obs and action spaces, have you checked RLlib’s example folder? E.g.: https://github.com/ray-project/ray/blob/master/rllib/examples/nested_action_spaces.py
Hi. Thanks for responding.
@gjoliver could you please let me know what specific information is needed for debugging so that i provide it?
I have seen the example you shared, but it is not clear what is the basis of deciding following values . Could you please help me understand it better?
“entropy_coeff”: 0.00005, # We don’t want high entropy in this Env.
sure. it would be the best if you can share a reproduce-able script so we can see the environment and test things on our end.
about the configuration parameters, the best way is to do a hyper-param search.
they are usually problem dependent.
“entropy_coeff” - controls the amount of entropy_loss that goes into total_loss. the higher this parameter is, the more stochastic the policy becomes.
“lr”: standard learning rate.
“nun_envs_per_worker”: how many envs to run in a single worker. if you think your workers are under utilized, you can try tuning this parameter.
“num_sgd_iter”: # of stochastic gradient descend steps we do for each batch of samples.
“num_workers”: this is the number of environments, rollout workers you want to use for the traine.
“vf_loss_coeff”: similar to entropy_coeff, this controls how much value function loss goes into the final total loss.