Slow Trainable initialization with many policies

PavelC · September 6, 2021, 4:42pm

I’m running experiments with multiagent and large amounts of policies (currently 10 - 100).
I noticed when I have many policies, e.g. 30, initialization of my trainable is very slow (more than 5 minutes). I am running this on what I think is good hardware (128 cores, 2.4 GHz, I set num_workers to 50 but I’m not sure if this has an effect on init)

Are there any obvious pitfalls that I am missing?
Is this expected? I’m surprised having many policies has such a large effect on initialization, I assumed it would amount to simply drawing more random weights.
Is there any way to speed this up?

I have created this very minimal example which illustrates the problem on both DQN and PPO.

import time

import gym
import ray
from ray.rllib.agents.dqn import DQNTrainer
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.tune import register_env

ray.init()

num_policies = 30

# Simple environment with 4 independent cartpole entities
register_env("multi_agent_cartpole",
             lambda _: MultiAgentCartPole({"num_agents": 30}))
single_dummy_env = gym.make("CartPole-v0")
obs_space = single_dummy_env.observation_space
act_space = single_dummy_env.action_space

policies = {str(i): (None, obs_space, act_space, {}) for i in range(num_policies)}

policy_mapping_fn = str


start = time.time()
ppo_trainer = PPOTrainer(
    env="multi_agent_cartpole",
    config={
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "policies_to_train": None,  # All
        }
    })

print(f"PPO init took {time.time() - start} seconds")
start = time.time()
dqn_trainer = DQNTrainer(
    env="multi_agent_cartpole",
    config={
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "policies_to_train": None,
        }
    })

print(f"DQN init took {time.time() - start} seconds")

Here is part of an example output:

2021-09-06 09:11:46,320	INFO trainable.py:109 -- Trainable.setup took 476.444 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-09-06 09:11:46,321	WARNING util.py:55 -- Install gputil for GPU system monitoring.
PPO init took 476.4660232067108 seconds
2021-09-06 09:11:46,652	WARNING deprecation.py:34 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
2021-09-06 09:23:35,170	INFO trainable.py:109 -- Trainable.setup took 708.833 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-09-06 09:23:35,171	WARNING util.py:55 -- Install gputil for GPU system monitoring.
DQN init took 708.8500621318817 seconds

sven1977 · September 27, 2021, 2:16pm

Hey @PavelC , great question and interesting find. I can’t really reproduce these extreme numbers. On my Mac, it takes less than a minute per Trainer (PPO and DQN) to create the 30 policies:

PPO init took 96.228423833847046 seconds
DQN init took 125.47070574760437 seconds

But this is with your above example and 10 workers (not 50!). But still, significantly less than 5min. Not sure, what’s going on. You have to understand that RLlib creates a separate graph and session for each policy in the tf + multi-agent case such that policies can be added (and removed) on-the-fly.

sven1977 · September 27, 2021, 2:27pm

Just to try this hypothesis: You may get faster Trainer build results when using torch as no static graph has to be constructed for each policy.

rusu24edward · September 28, 2021, 5:52pm

I’ve experienced slow initialization primarily as a function of the number of workers, not so much the number of policies. @PavelC have you tried setting up with less workers?

PavelC · November 9, 2021, 7:04pm

Sorry for the late reply, I had some work to do with lower amounts of policies.

Thank you for the responses @sven1977 and @rusu24edward , I think I see now that it is not unreasonable for initialization time to increase about linearly with more policies and workers if this affects the number of separate graphs that are created.

Now I ran some experiments:

num_policies	num_workers	init time seconds	machine
10	10	80	a
10	50	98	a
30	10	688	a
30	50	878	a
30	10	43	b
30	50	58	b

As you can see, even if I reduce the number of workers I get huge init times. When I set the num_policies low then a large number of workers does not seem to affect the init time much.

So, as you can see, the third line should be the same as you ran on your Mac, but I get way larger init times. The machine I tried has more than 100 cores, which I assume your Mac doesn’t have.

Now I ran the same again on a different machine that has fewer but more powerful CPUs und got the results marked as machine b in the table. Init times are very reasonable here.

My guess would be that the problem is that initialization is actually not done in parallel, so a machine with fewer stronger cores will take less time to initialize than a machine which has hundreds of free cores that are relatively old.

These new experiments were run with ray 1.7

In case someone wants to run this again, I made slight changes to the original test script:
slow_rllib_init.py

PavelC · November 9, 2021, 10:53pm

Ok, even with the stronger CPU at a certain threshold initialization takes much longer again:

num_policies	num_workers	seconds (tf)	seconds (tfe)
30	10	43
50	10	71
100	10	139
110	10	641	47
125	10	791
150	10	999
200	10	1385	99

The important part is the large jump in initialization time from 100 to 110 policies. Not sure what’s going on there. I actually see more than 1 CPU being used. Could this be a cache problem, where at a certain size things don’t fit into the cache anymore and main memory has to be used?

As you can also see from the table, using tfe seems to solve the problem, which makes sense.

PavelC · November 10, 2021, 1:36pm

Ok, sorry for all the spam, but I found another way to circumvent this problem while still using framework: tf as opposed to tfe
When we init the policies sequentialy instead of all at once, this problem seems to at least be reduced.
I.e. I did this:

    for i in range(1, num_policies):
        _ = ppo_trainer.add_policy(
            policy_id=str(i),
            policy_cls=type(ppo_trainer.get_policy('0'))
        )

Instead of adding the number of policies to the config from the start.
(see also the updated gist)

Here are my recorded timings for this:

num_policies	init sequentially	init all at once
100	144s	139s
110	183s	641s
200	425s	1385s

PavelC · November 17, 2021, 2:25pm

I have found the problem why increasing the number of policies beyond 100 causes a slowdown in initialization (as well as a slowdown of 10x when actually running the training):
Be sure to set the config parameter

"multiagent": {
    'policy_map_capacity': 100
}

to an approrpiate value, i.e. higher than the number of policies, if these policies are all frequently used. The default value is 100, which caused problems in my case.

Topic		Replies	Views
PPOTrainer is too SLOW to initialize RLlib	1	761	April 29, 2021
Rllib runs UNBELIEVABLY slow on windows, even on a basic cartpole environment RLlib	2	390	November 17, 2021
Multi-Agent Training with Different Algorithms RLlib	24	3177	October 11, 2022
How to deploy a trained Ray RLlib PPO policy/model in multi-agent-case? RLlib	5	742	November 10, 2021
RLlib w/ Unity3d ML Agents PPO is slow RLlib	1	309	February 3, 2024

Slow Trainable initialization with many policies

Related topics