Scaling Battle Experiments

I am trying to train a huge number of agents in the Battle environment in a decentralized manner. The experiments are extremely slow … from what I read in the docs:

  • If the environment is slow and cannot be replicated (e.g., since it requires interaction with physical systems), then you should use a sample-efficient off-policy algorithm such as DQN or SAC. These algorithms default to num_workers: 0 for single-process operation. Make sure to set num_gpus: 1 if you want to use a GPU. Consider also batch RL training with the offline data API.
  • If the environment is fast and the model is small (most models for RL are), use time-efficient algorithms such as PPO, IMPALA, or APEX. These can be scaled by increasing num_workers to add rollout workers. It may also make sense to enable vectorization for inference. Make sure to set num_gpus: 1 if you want to use a GPU. If the learner becomes a bottleneck, multiple GPUs can be used for learning by setting num_gpus > 1 .

I am not sure which one of these categories my experiments fall into … There are 120 agents in the environment and increasing the number of rollout workers makes the matter worse since a huge amount of memory would be required to hold all the policies and offloading to storage would be required … (RLLib automatically does that by creating 120*num_workers policy files in the location specified by policy_map_cache) … On the other hand, setting num_worker to 0 and using DQN does not help either, experiments often get stuck.

I am wondering if there are any ways to optimize the runs (the amount of memory needed) … I have also played around with “num_sgd_iter”, “rollout_fragment_length”, “train_batch_size”, “sgd_minibatch_size” but I’m not sure if they had any effect. Using the parallelized version of the environment did not help either … With AEC version 2 iteration took around 30 minutes for 120 agents with 180GB memory.

Hey @A_M , thanks for the post. What you describe is a setup that we ourselves are currently exploring and trying to scale better with RLlib. We are working on releasing an AlphaStar-like algorithm that can handle 10-100 learning policies in a parallelized fashion on multiple (multi-) GPU machines. We expect to have this in master sometime next week to mid Feb.

Some questions for you:

  • All these 120 policies are actually learning? Or some of them are “frozen” (not in the “config.multiagent.policies_to_train” list)?
  • Did you take a look at the ray/rllib/examples/ example script?
  • You could make sure to only use certain policies on certain worker indices to avoid your PolicyMaps to perform too many disk read/writes. E.g. on worker with self.worker_index==1, you only ever play the first 10 policies in your list, etc… This way, if you had enough workers, you would still get a uniform distribution of policies in your trajectories.

Note: W/o the above mentioned AlphaStar, you won’t be able to perform parallel learning updates, so the learning step will remain slow (b/c sequential) for you right now. AlphaStar should solve that problem, though. Stay tuned :slight_smile: