Restoring nn after training in multi agent environment

Hi everyone, I am trying to obtain the network I trained. I tried the following code for single agent scenario

from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.logger import pretty_print
from gym_pybullet_drones.envs.multi_agent_rl.CustomBaseMAA3 import CustomRl3
import os
import ray
from ray.rllib.policy.policy import PolicySpec

ray.init(num_cpus=5)
temp_env = CustomRl3()
pol = PolicySpec()
policies = {"policy_1": pol}
policy_ids = list("policy_1")

def policy_mapping_fn(agent_id, episode, worker, **kwargs):
    return "policy_1"

algo = (
    PPOConfig()
    .rollouts(num_rollout_workers=1)
    .resources(num_gpus=0)
    .environment(env=CustomRl3)
    .framework("torch")
    .training(num_sgd_iter=5)
    .resources(num_gpus=int(os.environ.get("RLLIB_NUM_GPUS", "0")))
    .build()
)

print(algo.get_policy().get_weights())

ray.shutdown()

which works well, but when I try to extract the network in the multi agent scenario it fails.

The code I use in the multi agent scenario is the following:

from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.logger import pretty_print
from gym_pybullet_drones.envs.multi_agent_rl.CustomBaseMAA3 import CustomRl3
import os
import ray
from ray.rllib.policy.policy import PolicySpec

ray.init(num_cpus=5)
temp_env = CustomRl3()
pol = PolicySpec()
policies = {"policy_1": pol}
policy_ids = list("policy_1")

def policy_mapping_fn(agent_id, episode, worker, **kwargs):
    return "policy_1"

algo = (
    PPOConfig()
    .rollouts(num_rollout_workers=1)
    .multi_agent(policies=policies, policy_mapping_fn=policy_mapping_fn)
    .resources(num_gpus=0)
    .environment(env=CustomRl3)
    .framework("torch")
    .training(num_sgd_iter=5)
    .resources(num_gpus=int(os.environ.get("RLLIB_NUM_GPUS", "0")))
    .build()
)

print(algo.get_policy().get_weights())
ray.shutdown()

In the multi agent scenario the function get_policy() returns None. In this blog post , it is suggested to use PPOTrainer, but I get an error when I try to use it. Is the PPOTrainer still supported?

Thanks!

I can get the weights from a check point using the following code:

checkpoint_path = path_to_checkpoint
policy = Policy.from_checkpoint(checkpoint=checkpoint_path)
wghts = policy.get_weights()

I configured 2 hidden layers with 64 neurons each.

When I print the shape of the weights I get the following output:

print(wghts['encoder.actor_encoder.net.mlp.0.weight'].shape)
print(wghts['encoder.actor_encoder.net.mlp.2.weight'].shape)
print(wghts['pi.net.mlp.0.weight'].shape)

(64, 13)
(64, 64)
(8, 64)

The observation space is Box([-1. -1. -1. -1. -1. -1. -1. -1. -1. 0. 0. 0. 0.], [1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 1. 1. 2.], (13,), float32) and the action space is Box(-1.0, 1.0, (4,), float32).

shouldnt the shape of wghts['pi.net.mlp.0.weight'] be (4,64) according to the action space?

Hi @sAz-G,

If you have a continuous action space then the size of the policy outputs will be num_actions*2. Half of the outputs will specify the mean and the other half the standard deviation of your output distribution which is probably a diagonal gaussian.

  1. When I train the model, the actions inserted to the step function, step(action), have only four elements, shouldnt they also contain num_actions*2 actions? Or do the actions inserted to the step function are already drawn from a distribution?

  2. I also tried to compute actions (based on the current observation) using the function policy.compute_single_action(observation) . The last 4 elements out of the 8 are sometimes negative. If the last 4 are the standard deviation, shouldn’t they be non negative?

  3. I need to use the weights in another program but I am a bit confused about how to get the trained neural network. Do I need the weights that are in wghts['encoder.actor_encoder.net.mlp.0.weight'], wghts['encoder.actor_encoder.net.mlp.2.weight'], wghts['pi.net.mlp.0.weight'] and their respective bias to reconstruct the network?