Scaling Battle Experiments

I am trying to train a huge number of agents in the Battle environment in a decentralized manner. The experiments are extremely slow … from what I read in the docs:

  • If the environment is slow and cannot be replicated (e.g., since it requires interaction with physical systems), then you should use a sample-efficient off-policy algorithm such as DQN or SAC. These algorithms default to num_workers: 0 for single-process operation. Make sure to set num_gpus: 1 if you want to use a GPU. Consider also batch RL training with the offline data API.
  • If the environment is fast and the model is small (most models for RL are), use time-efficient algorithms such as PPO, IMPALA, or APEX. These can be scaled by increasing num_workers to add rollout workers. It may also make sense to enable vectorization for inference. Make sure to set num_gpus: 1 if you want to use a GPU. If the learner becomes a bottleneck, multiple GPUs can be used for learning by setting num_gpus > 1 .

I am not sure which one of these categories my experiments fall into … There are 120 agents in the environment and increasing the number of rollout workers makes the matter worse since a huge amount of memory would be required to hold all the policies and offloading to storage would be required … (RLLib automatically does that by creating 120*num_workers policy files in the location specified by policy_map_cache) … On the other hand, setting num_worker to 0 and using DQN does not help either, experiments often get stuck.

I am wondering if there are any ways to optimize the runs (the amount of memory needed) … I have also played around with “num_sgd_iter”, “rollout_fragment_length”, “train_batch_size”, “sgd_minibatch_size” but I’m not sure if they had any effect. Using the parallelized version of the environment did not help either … With AEC version 2 iteration took around 30 minutes for 120 agents with 180GB memory.

Hey @A_M , thanks for the post. What you describe is a setup that we ourselves are currently exploring and trying to scale better with RLlib. We are working on releasing an AlphaStar-like algorithm that can handle 10-100 learning policies in a parallelized fashion on multiple (multi-) GPU machines. We expect to have this in master sometime next week to mid Feb.

Some questions for you:

  • All these 120 policies are actually learning? Or some of them are “frozen” (not in the “config.multiagent.policies_to_train” list)?
  • Did you take a look at the ray/rllib/examples/self_play_league_based_with_open_spiel.py example script?
  • You could make sure to only use certain policies on certain worker indices to avoid your PolicyMaps to perform too many disk read/writes. E.g. on worker with self.worker_index==1, you only ever play the first 10 policies in your list, etc… This way, if you had enough workers, you would still get a uniform distribution of policies in your trajectories.

Note: W/o the above mentioned AlphaStar, you won’t be able to perform parallel learning updates, so the learning step will remain slow (b/c sequential) for you right now. AlphaStar should solve that problem, though. Stay tuned :slight_smile:

1 Like

We are working on releasing an AlphaStar-like algorithm that can handle 10-100 learning policies in a parallelized fashion on multiple (multi-) GPU machines.

@sven1977 Sorry for the late reply, but do you have any more information on this? I’ve been working on an RLlib-based replication of AlphaStar, and it’d be very relevant to what I’m ultimately trying to do.

Ray RLlib has been actively developing support for AlphaStar-like multi-policy, multi-GPU training. As of early 2022, the RLlib team was working on an algorithm capable of handling 10–100 learning policies in parallel across multiple GPUs, with an expected release in master “sometime next week to mid Feb” (2022). This would enable parallel learning updates for large-scale multi-agent setups, which was not possible with the then-current RLlib (learning was sequential) but would be addressed by the new AlphaStar-style implementation. For more details and updates, see the discussion at Scaling Battle Experiments.

If you need an example of league-based self-play and policy management (as in AlphaStar), RLlib provides a runnable script: self_play_league_based_with_open_spiel.py. This script demonstrates league-based training, policy freezing, and dynamic matchmaking, which are core AlphaStar concepts.

Would you like a step-by-step breakdown of how to use these features or more technical details on the current implementation?

Sources:

Hint: Mention @RunLLM in the post for followups.