Custom torch model for PPO with discrete actions

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,
I need help understanding how to structure a custom model for a simple MLP producing discrete actions. I want to be able to use this model for inference in a plain Pytorch environment (no Ray involved). I want the model’s init() to load the weights from a checkpoint. I am starting from a successful baseline I built for a different project, but it has continuous action space and was trained with SAC. I’m stuck adapting that code, as I don’t understand the layer names & uses that the RLlib PPO algorithm assigns.

Here is the successful SAC/continuous action model:

class BridgitNN(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config, name):

        TorchModelV2.__init__(self, obs_space, action_space, num_outputs, model_config, name)
        nn.Module.__init__(self)

        self._num_actions = num_outputs
        NUM_FC1_NEURONS = 100
        NUM_FC2_NEURONS = 640
        NUM_FC3_NEURONS = 128
        BRIDGIT_MODEL = "~/chkpt/policies/default_policy/model/model.pt"

        self.fc1 = nn.Linear(ObsVec.BASE_SENSOR_DATA, NUM_FC1_NEURONS)
        self.fc2 = nn.Linear(NUM_FC1_NEURONS, NUM_FC2_NEURONS)
        self.fc3 = nn.Linear(NUM_FC2_NEURONS, NUM_FC3_NEURONS)

        self._actor_head = nn.Sequential(
            nn.Linear(NUM_FC3_NEURONS, self._num_actions)
        )

        self._critic_head = nn.Sequential(
            nn.Linear(NUM_FC3_NEURONS, 1)
        )

        # Load the weights for the main Bridgit model's actor network (since this will be used for inference only)
        sd = torch.load(BRIDGIT_MODEL).state_dict()
        with torch.no_grad():
            self.fc1.weight.copy_(sd["action_model.fc1.weight"])
            self.fc1.bias.copy_(sd["action_model.fc1.bias"])

            self.fc2.weight.copy_(sd["action_model.fc2.weight"])
            self.fc2.bias.copy_(sd["action_model.fc2.bias"])

            self.fc3.weight.copy_(sd["action_model.fc3.weight"])
            self.fc3.bias.copy_(sd["action_model.fc3.bias"])

            self._actor_head[0].weight.copy_(sd["action_model._actor_head.0.weight"])
            self._actor_head[0].bias.copy_(sd["action_model._actor_head.0.bias"])


    def forward(self, input_dict, state, seq_lens):

        x = input_dict["obs"]

        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.tanh(self.fc3(x))

        # Final layer for the actor output (action values)
        actions = self._actor_head(x)

        # Final layer for the critic value
        self._value = self._critic_head(x).reshape(-1)

        return actions, state


    def value_function(self):
        return self._value

Now I have started a new project using PPO to train a similar network, but for discrete action space. I trained it using Ray.Train, using something like:

algo = cfg.build()
while iter in range(MAX_ITERATIONS):
    algo.train()
    if iter % CKPT_INT == 0:
        algo.save(checkpoint_dir = PATH)

The model performs well and now I want to use one of the checkpoints for inference in a Pytorch-only program (i.e. without invoking any RLlib code). Doing so means that my model code must understand the checkpoint file’s naming convention for the various layers, and their purposes. This is where I’m completely lost, because I cannot find examples anywhere of how to interpret this.

I’ll not show the code for this new model here, as it is virtually identical to the above. However, when I stick a print statement in after the call to torch.load(), I see the following model structure (tensor contents deleted for brevity):

OrderedDict([('_logits._model.0.weight', tensor(...)),
('_logits._model.0.bias', tensor(...)),
('_hidden_layers.0._model.0.weight', tensor(...)),
('_hidden_layers.0._model.0.bias', tensor(...)),
('_hidden_layers.1._model.0.weight', tensor(...)),
('_hidden_layers.1._model.0.bias', tensor(...)),
('_hidden_layers.2._model.0.weight', tensor(...)),
('_hidden_layers.2._model.0.bias', tensor(...)),
('_value_branch_separate.0._model.0.weight', tensor(...)),
('_value_branch_separate.0._model.0.bias', tensor(...)),
('_value_branch_separate.1._model.0.weight', tensor(...)),
('_value_branch_separate.1._model.0.bias', tensor(...)),
('_value_branch_separate.2._model.0.weight', tensor(...)),
('_value_branch_separate.2._model.0.bias', tensor(...)),
('_value_branch._model.0.weight', tensor(...)),
('_value_branch._model.0.bias', tensor(...))])

It seems obvious that the _hidden_layers are to be used as their name suggests. What I really struggle with is what the actor_head and critic_head code should look like and how should it use the remaining layers. Can anyone please help? If you know of a working example out there, I’d love to see it. Thanks!