Continuous action space and custom model

Vladimir_Uspenskii · July 15, 2021, 6:20pm

Hi guys,
First of all many thanks to this wonderful community that has already helped me a lot!
I’m solving an optimisation problem with Rllib (so far, I’m very optimistic). The particular details are quite complex, but can be broken down to ‘shooting’ targets in 2d space:

The observation space is NxN grid with ‘targets’. When ‘shot’, each target disappears, this logic is implemented on the side on the custom environment.
After some success with solving this with discrete action space and PPO (using each grid square as a separate action, NxN in total), I’ve hit the wall with growing of the observation space and, in turn, growing of the action space.
The obvious idea here was t0 try the continuous action space, moreover, the problem looks continuous ‘by nature’. Therefore I implemented the continuous space like

gym.spaces.Box(low=0, high=1, shape=(2,)

for the ‘x’ and ‘y’ dimensions of the grid.

For some reason, it looks like the Rllib documentation lacks some examples of continuous action spaces used with custom models.
From my understanding of the theoretical part, I need my model to output 2 means and 2 std’s for two-dimensional Gaussian distribution. Can you, please point me to some examples of how Rllib is expecting this output to be formatted? As I understand it, the action distribution will be based on the action space, however, I can’t find the module in the Rllib code responsible for sampling the outputs of my model based on the action space of my environment.
Basically, what I’m asking is if my model forward pass outputs something like

def forward(self, input_dict, state, seq_lens):
 ...
   return outputs, state

How do I decide on the outputs.shape given the action space I have above?
Will be grateful for any hints!

Sertingolix · July 16, 2021, 8:48am

The outputs should be of size [BATCH,num_outputs] as found in the Modelv2 API. In your case num outputs will be 4. AS you already said 2 for the mean and 2 for std (also in that order). The code for the default GausianActionDist can be found at ActionDist. Maybe I misunderstood the question

mannyv · July 16, 2021, 12:14pm

Hi @Vladimir_Uspenskii,

Rllib will compute the value for you automatically. This should work fine unless you have a custom gym space or action distribution that does not follow the standard ones.

The modelv2 init dunder has an argument called num_outputs that tells you how many outputs the forward method should produce.

You can see the arguments here:

github.com

ray-project/ray/blob/21b464ae9dcae819f3d51da0bd626b30cc34c3b4/rllib/models/modelv2.py#L36

    
      
          """Defines an abstract neural network model for use with RLlib.
          
          
Custom models should extend either TFModelV2 or TorchModelV2 instead of
          this class directly.
          
          
Data flow:
              obs -> forward() -> model_out
                     value_function() -> V(s)
          """
          
          
def __init__(self, obs_space: gym.spaces.Space,
                       action_space: gym.spaces.Space, num_outputs: int,
                       model_config: ModelConfigDict, name: str, framework: str):
              """Initializes a ModelV2 object.
          
          
    This method should create any variables used by the model.
          
          
    Args:
                  obs_space (gym.spaces.Space): Observation space of the target gym
                      env. This may have an `original_space` attribute that
                      specifies how to unflatten the tensor into a ragged tensor.

Vladimir_Uspenskii · July 16, 2021, 6:33pm

Hi @Sertingolix,
thanks! Yes, torch_action_dist was exactly what I was looking for, now I see how the outputs of the model are handled. One more stupid question: if I was to shift to TorchSquashedGaussian or TorchBeta, how do I do that?
@mannyv, thanks! Yes, I’ve notices the num_outputs param, though I had trouble with understanding how the outputs will be treated thereafter, what output goes where. It seems to work, now I have a chance of deciding if the continuous space will work better after all.
Thank you for your help!

mannyv · July 17, 2021, 11:12am

Hi @Vladimir_Uspenskii,

You can follow the example here. In this example it defines an entirely new action distribution but you do not need to do that you can just specify one of the predefined ones to override the default one for the policy.

https://docs.ray.io/en/master/rllib-models.html#custom-action-distributions

Topic		Replies	Views
There was an error changing the trajecy_tory_view_api into continuous action space RLlib	7	597	February 22, 2022
Rllib determines the incorrect logit size when using Box action space and a custom model RLlib	2	430	January 21, 2022
Observation dependent continuous action space ("Masking" continuous action space) RLlib	4	1106	February 9, 2022
Is any multi discrete action example for PPO or other algorithms? RLlib	9	4356	January 29, 2023
Custom action space Configure Algorithm, Training, Evaluation, Scaling	4	599	July 31, 2023

Continuous action space and custom model

Related topics