Slow Trainable initialization with many policies

I’m running experiments with multiagent and large amounts of policies (currently 10 - 100).
I noticed when I have many policies, e.g. 30, initialization of my trainable is very slow (more than 5 minutes). I am running this on what I think is good hardware (128 cores, 2.4 GHz, I set num_workers to 50 but I’m not sure if this has an effect on init)

Are there any obvious pitfalls that I am missing?
Is this expected? I’m surprised having many policies has such a large effect on initialization, I assumed it would amount to simply drawing more random weights.
Is there any way to speed this up?

I have created this very minimal example which illustrates the problem on both DQN and PPO.

import time

import gym
import ray
from ray.rllib.agents.dqn import DQNTrainer
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.tune import register_env


num_policies = 30

# Simple environment with 4 independent cartpole entities
             lambda _: MultiAgentCartPole({"num_agents": 30}))
single_dummy_env = gym.make("CartPole-v0")
obs_space = single_dummy_env.observation_space
act_space = single_dummy_env.action_space

policies = {str(i): (None, obs_space, act_space, {}) for i in range(num_policies)}

policy_mapping_fn = str

start = time.time()
ppo_trainer = PPOTrainer(
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "policies_to_train": None,  # All

print(f"PPO init took {time.time() - start} seconds")
start = time.time()
dqn_trainer = DQNTrainer(
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "policies_to_train": None,

print(f"DQN init took {time.time() - start} seconds")

Here is part of an example output:

2021-09-06 09:11:46,320	INFO -- Trainable.setup took 476.444 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-09-06 09:11:46,321	WARNING -- Install gputil for GPU system monitoring.
PPO init took 476.4660232067108 seconds
2021-09-06 09:11:46,652	WARNING -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
2021-09-06 09:23:35,170	INFO -- Trainable.setup took 708.833 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-09-06 09:23:35,171	WARNING -- Install gputil for GPU system monitoring.
DQN init took 708.8500621318817 seconds

Hey @PavelC , great question and interesting find. I can’t really reproduce these extreme numbers. On my Mac, it takes less than a minute per Trainer (PPO and DQN) to create the 30 policies:

PPO init took 96.228423833847046 seconds
DQN init took 125.47070574760437 seconds

But this is with your above example and 10 workers (not 50!). But still, significantly less than 5min. Not sure, what’s going on. You have to understand that RLlib creates a separate graph and session for each policy in the tf + multi-agent case such that policies can be added (and removed) on-the-fly.

Just to try this hypothesis: You may get faster Trainer build results when using torch as no static graph has to be constructed for each policy.

I’ve experienced slow initialization primarily as a function of the number of workers, not so much the number of policies. @PavelC have you tried setting up with less workers?

Sorry for the late reply, I had some work to do with lower amounts of policies.

Thank you for the responses @sven1977 and @rusu24edward , I think I see now that it is not unreasonable for initialization time to increase about linearly with more policies and workers if this affects the number of separate graphs that are created.

Now I ran some experiments:

num_policies num_workers init time seconds machine
10 10 80 a
10 50 98 a
30 10 688 a
30 50 878 a
30 10 43 b
30 50 58 b

As you can see, even if I reduce the number of workers I get huge init times. When I set the num_policies low then a large number of workers does not seem to affect the init time much.

So, as you can see, the third line should be the same as you ran on your Mac, but I get way larger init times. The machine I tried has more than 100 cores, which I assume your Mac doesn’t have.

Now I ran the same again on a different machine that has fewer but more powerful CPUs und got the results marked as machine b in the table. Init times are very reasonable here.

My guess would be that the problem is that initialization is actually not done in parallel, so a machine with fewer stronger cores will take less time to initialize than a machine which has hundreds of free cores that are relatively old.

These new experiments were run with ray 1.7

In case someone wants to run this again, I made slight changes to the original test script:

Ok, even with the stronger CPU at a certain threshold initialization takes much longer again:

num_policies num_workers seconds (tf) seconds (tfe)
30 10 43
50 10 71
100 10 139
110 10 641 47
125 10 791
150 10 999
200 10 1385 99

The important part is the large jump in initialization time from 100 to 110 policies. Not sure what’s going on there. I actually see more than 1 CPU being used. Could this be a cache problem, where at a certain size things don’t fit into the cache anymore and main memory has to be used?

As you can also see from the table, using tfe seems to solve the problem, which makes sense.

Ok, sorry for all the spam, but I found another way to circumvent this problem while still using framework: tf as opposed to tfe
When we init the policies sequentialy instead of all at once, this problem seems to at least be reduced.
I.e. I did this:

    for i in range(1, num_policies):
        _ = ppo_trainer.add_policy(

Instead of adding the number of policies to the config from the start.
(see also the updated gist)

Here are my recorded timings for this:

num_policies init sequentially init all at once
100 144s 139s
110 183s 641s
200 425s 1385s
1 Like

I have found the problem why increasing the number of policies beyond 100 causes a slowdown in initialization (as well as a slowdown of 10x when actually running the training):
Be sure to set the config parameter

"multiagent": {
    'policy_map_capacity': 100

to an approrpiate value, i.e. higher than the number of policies, if these policies are all frequently used. The default value is 100, which caused problems in my case.