Why doesn't policy model respect obs_space and action_space?

Eric_Adlam · February 6, 2021, 3:41am

I’m trying to understand rllib policies. It seems that their models do not actually respect the observation and action space:

Here is what I’m working with:

(Pdb) type(policy)
<class 'ray.rllib.policy.torch_policy_template.PPOTorchPolicy'>
(Pdb) type(policy.model)
<class 'ray.rllib.models.torch.fcnet.FullyConnectedNetwork'>
(Pdb) policy.model.obs_space
Box(-5000.0, 5000.0, (260,), float32)
(Pdb) policy.model.action_space
Discrete(4)
(Pdb) obs = torch.as_tensor(policy.model.obs_space.sample())
(Pdb) obs.size()
torch.Size([260])

I expect that the model.forward function would accept this observation as it is a direct sample of the models obs_space, and I expect it would return a Tensor of shape torch.Size([4]). However, when I run it I get this:

(Pdb) actions = policy.model.forward({"obs_flat":obs, "obs":obs}, None, None)
*** RuntimeError: mat1 and mat2 shapes cannot be multiplied (260x1 and 260x256)

And if I pass it the observation that it seems to actually expect, the returned action logits are not of the shape I expect:

(Pdb) new_obs = tensor.repeat(256, 1)
(Pdb) new_obs.size()
torch.Size([256, 260])
(Pdb) actions = policy.model.forward({"obs_flat":new_obs, "obs":new_obs}, None, None)
(Pdb) actions
(tensor([[-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        ...,
        [-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558]], grad_fn=<AddmmBackward>), None)
(Pdb) actions[0].size()
torch.Size([256, 4])

What I expect is a tensor of torch.Size([4]) like this:

(Pdb) mean_actions = torch.mean(actions[0], 0)
(Pdb) mean_actions
tensor([-0.3594,  0.5544, -1.1114,  1.4558], grad_fn=<MeanBackward1>)

What am I misunderstanding?

sven1977 · February 8, 2021, 8:48am

Hey @Eric_Adlam , thanks for posting this question. I think you got confused by the batch dimension.
The spaces stored in the policy’s and model’s obs_space/action_space properties are always the non-batched spaces, so for example: Box(-1.0, 1.0, (5, )) for a 5-dimensional space w/o(!) the batcg dimension. In your example, you are doing this correctly by adding a batch dimension (256):

Taking your example:

(Pdb) new_obs = tensor.repeat(256, 1)
(Pdb) new_obs.size()
torch.Size([256, 260])
(Pdb) actions = policy.model.forward({"obs_flat":new_obs, "obs":new_obs}, None, None)
(Pdb) actions
(tensor([[-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        ...,
        [-0.3594,  0.5544, -1.1114,  1.4558]], grad_fn=<AddmmBackward>), None)
(Pdb) actions[0].size()
torch.Size([256, 4])

The model correctly outputs 256 x the 4 logits for your action distribution to sample from.
So in other words, you pass in 256 different observations and - for each one - get an action back. This is the correct behavior as you need an action for each observation.

Topic		Replies	Views
Rllib determines the incorrect logit size when using Box action space and a custom model RLlib	2	433	January 21, 2022
Observation space with multiple input RLlib	15	3419	December 10, 2021
[rllib] Dict Action Space and Custom Model RLlib	5	2482	March 30, 2021
Multiple action spaces RLlib	5	546	October 14, 2022
Continuous action space and custom model RLlib	4	1593	July 17, 2021

Why doesn't policy model respect obs_space and action_space?

Related topics