Custom torch model for PPO with discrete actions

starkj · March 26, 2024, 3:03am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello,
I need help understanding how to structure a custom model for a simple MLP producing discrete actions. I want to be able to use this model for inference in a plain Pytorch environment (no Ray involved). I want the model’s init() to load the weights from a checkpoint. I am starting from a successful baseline I built for a different project, but it has continuous action space and was trained with SAC. I’m stuck adapting that code, as I don’t understand the layer names & uses that the RLlib PPO algorithm assigns.

Here is the successful SAC/continuous action model:

class BridgitNN(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config, name):

        TorchModelV2.__init__(self, obs_space, action_space, num_outputs, model_config, name)
        nn.Module.__init__(self)

        self._num_actions = num_outputs
        NUM_FC1_NEURONS = 100
        NUM_FC2_NEURONS = 640
        NUM_FC3_NEURONS = 128
        BRIDGIT_MODEL = "~/chkpt/policies/default_policy/model/model.pt"

        self.fc1 = nn.Linear(ObsVec.BASE_SENSOR_DATA, NUM_FC1_NEURONS)
        self.fc2 = nn.Linear(NUM_FC1_NEURONS, NUM_FC2_NEURONS)
        self.fc3 = nn.Linear(NUM_FC2_NEURONS, NUM_FC3_NEURONS)

        self._actor_head = nn.Sequential(
            nn.Linear(NUM_FC3_NEURONS, self._num_actions)
        )

        self._critic_head = nn.Sequential(
            nn.Linear(NUM_FC3_NEURONS, 1)
        )

        # Load the weights for the main Bridgit model's actor network (since this will be used for inference only)
        sd = torch.load(BRIDGIT_MODEL).state_dict()
        with torch.no_grad():
            self.fc1.weight.copy_(sd["action_model.fc1.weight"])
            self.fc1.bias.copy_(sd["action_model.fc1.bias"])

            self.fc2.weight.copy_(sd["action_model.fc2.weight"])
            self.fc2.bias.copy_(sd["action_model.fc2.bias"])

            self.fc3.weight.copy_(sd["action_model.fc3.weight"])
            self.fc3.bias.copy_(sd["action_model.fc3.bias"])

            self._actor_head[0].weight.copy_(sd["action_model._actor_head.0.weight"])
            self._actor_head[0].bias.copy_(sd["action_model._actor_head.0.bias"])


    def forward(self, input_dict, state, seq_lens):

        x = input_dict["obs"]

        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.tanh(self.fc3(x))

        # Final layer for the actor output (action values)
        actions = self._actor_head(x)

        # Final layer for the critic value
        self._value = self._critic_head(x).reshape(-1)

        return actions, state


    def value_function(self):
        return self._value

Now I have started a new project using PPO to train a similar network, but for discrete action space. I trained it using Ray.Train, using something like:

algo = cfg.build()
while iter in range(MAX_ITERATIONS):
    algo.train()
    if iter % CKPT_INT == 0:
        algo.save(checkpoint_dir = PATH)

The model performs well and now I want to use one of the checkpoints for inference in a Pytorch-only program (i.e. without invoking any RLlib code). Doing so means that my model code must understand the checkpoint file’s naming convention for the various layers, and their purposes. This is where I’m completely lost, because I cannot find examples anywhere of how to interpret this.

I’ll not show the code for this new model here, as it is virtually identical to the above. However, when I stick a print statement in after the call to torch.load(), I see the following model structure (tensor contents deleted for brevity):

OrderedDict([('_logits._model.0.weight', tensor(...)),
('_logits._model.0.bias', tensor(...)),
('_hidden_layers.0._model.0.weight', tensor(...)),
('_hidden_layers.0._model.0.bias', tensor(...)),
('_hidden_layers.1._model.0.weight', tensor(...)),
('_hidden_layers.1._model.0.bias', tensor(...)),
('_hidden_layers.2._model.0.weight', tensor(...)),
('_hidden_layers.2._model.0.bias', tensor(...)),
('_value_branch_separate.0._model.0.weight', tensor(...)),
('_value_branch_separate.0._model.0.bias', tensor(...)),
('_value_branch_separate.1._model.0.weight', tensor(...)),
('_value_branch_separate.1._model.0.bias', tensor(...)),
('_value_branch_separate.2._model.0.weight', tensor(...)),
('_value_branch_separate.2._model.0.bias', tensor(...)),
('_value_branch._model.0.weight', tensor(...)),
('_value_branch._model.0.bias', tensor(...))])

It seems obvious that the _hidden_layers are to be used as their name suggests. What I really struggle with is what the actor_head and critic_head code should look like and how should it use the remaining layers. Can anyone please help? If you know of a working example out there, I’d love to see it. Thanks!

starkj · May 8, 2024, 1:53am

Solved by looking through RLlib source code. Using the NN definition in rllib.models.torch.FullyConnectedNetwork, it confirms that
a) the self._logits object is a SlimFC that has no activation. It is not appended to the layers list.
b) self._value_branch_separate is an independent network that mirrors the actor network (hidden_layers) structure.
c) self._value_branch is only the final layer that is tacked onto the _value_branch_separate layers. It is analogous to the _logits layer for the main network.

Topic		Replies	Views
PPO custom model with LSTM RLlib	0	31	June 11, 2025
State shapes incorrect using custom model (TorchModelV2) (PPO) RLlib	2	429	July 15, 2021
Issue with Custom PyTorch Model in Ray RLlib RLlib	0	306	November 3, 2023
Custom PyTorch model implementation for PPO training RLlib	1	383	July 23, 2023
Seperate networks for actor and critic in the ppo RLlib	2	789	April 14, 2022

Custom torch model for PPO with discrete actions

Related topics