Continuous action space and custom model

Hi guys,
First of all many thanks to this wonderful community that has already helped me a lot!
I’m solving an optimisation problem with Rllib (so far, I’m very optimistic). The particular details are quite complex, but can be broken down to ‘shooting’ targets in 2d space:

The observation space is NxN grid with ‘targets’. When ‘shot’, each target disappears, this logic is implemented on the side on the custom environment.
After some success with solving this with discrete action space and PPO (using each grid square as a separate action, NxN in total), I’ve hit the wall with growing of the observation space and, in turn, growing of the action space.
The obvious idea here was t0 try the continuous action space, moreover, the problem looks continuous ‘by nature’. Therefore I implemented the continuous space like

gym.spaces.Box(low=0, high=1, shape=(2,)

for the ‘x’ and ‘y’ dimensions of the grid.

For some reason, it looks like the Rllib documentation lacks some examples of continuous action spaces used with custom models.
From my understanding of the theoretical part, I need my model to output 2 means and 2 std’s for two-dimensional Gaussian distribution. Can you, please point me to some examples of how Rllib is expecting this output to be formatted? As I understand it, the action distribution will be based on the action space, however, I can’t find the module in the Rllib code responsible for sampling the outputs of my model based on the action space of my environment. :exploding_head:
Basically, what I’m asking is if my model forward pass outputs something like

def forward(self, input_dict, state, seq_lens):
   return outputs, state

How do I decide on the outputs.shape given the action space I have above?
Will be grateful for any hints!

The outputs should be of size [BATCH,num_outputs] as found in the Modelv2 API. In your case num outputs will be 4. AS you already said 2 for the mean and 2 for std (also in that order). The code for the default GausianActionDist can be found at ActionDist. Maybe I misunderstood the question

Hi @Vladimir_Uspenskii,

Rllib will compute the value for you automatically. This should work fine unless you have a custom gym space or action distribution that does not follow the standard ones.

The modelv2 init dunder has an argument called num_outputs that tells you how many outputs the forward method should produce.

You can see the arguments here:

Hi @Sertingolix,
thanks! Yes, torch_action_dist was exactly what I was looking for, now I see how the outputs of the model are handled. One more stupid question: if I was to shift to TorchSquashedGaussian or TorchBeta, how do I do that?
@mannyv, thanks! Yes, I’ve notices the num_outputs param, though I had trouble with understanding how the outputs will be treated thereafter, what output goes where. It seems to work, now I have a chance of deciding if the continuous space will work better after all.
Thank you for your help!

Hi @Vladimir_Uspenskii,

You can follow the example here. In this example it defines an entirely new action distribution but you do not need to do that you can just specify one of the predefined ones to override the default one for the policy.