Slow Trainable initialization with many policies

I’m running experiments with multiagent and large amounts of policies (currently 10 - 100).
I noticed when I have many policies, e.g. 30, initialization of my trainable is very slow (more than 5 minutes). I am running this on what I think is good hardware (128 cores, 2.4 GHz, I set num_workers to 50 but I’m not sure if this has an effect on init)

Are there any obvious pitfalls that I am missing?
Is this expected? I’m surprised having many policies has such a large effect on initialization, I assumed it would amount to simply drawing more random weights.
Is there any way to speed this up?

I have created this very minimal example which illustrates the problem on both DQN and PPO.

import time

import gym
import ray
from ray.rllib.agents.dqn import DQNTrainer
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.tune import register_env

ray.init()

num_policies = 30

# Simple environment with 4 independent cartpole entities
register_env("multi_agent_cartpole",
             lambda _: MultiAgentCartPole({"num_agents": 30}))
single_dummy_env = gym.make("CartPole-v0")
obs_space = single_dummy_env.observation_space
act_space = single_dummy_env.action_space

policies = {str(i): (None, obs_space, act_space, {}) for i in range(num_policies)}

policy_mapping_fn = str


start = time.time()
ppo_trainer = PPOTrainer(
    env="multi_agent_cartpole",
    config={
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "policies_to_train": None,  # All
        }
    })

print(f"PPO init took {time.time() - start} seconds")
start = time.time()
dqn_trainer = DQNTrainer(
    env="multi_agent_cartpole",
    config={
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "policies_to_train": None,
        }
    })

print(f"DQN init took {time.time() - start} seconds")

Here is part of an example output:

2021-09-06 09:11:46,320	INFO trainable.py:109 -- Trainable.setup took 476.444 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-09-06 09:11:46,321	WARNING util.py:55 -- Install gputil for GPU system monitoring.
PPO init took 476.4660232067108 seconds
2021-09-06 09:11:46,652	WARNING deprecation.py:34 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
2021-09-06 09:23:35,170	INFO trainable.py:109 -- Trainable.setup took 708.833 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-09-06 09:23:35,171	WARNING util.py:55 -- Install gputil for GPU system monitoring.
DQN init took 708.8500621318817 seconds