Why doesn't policy model respect obs_space and action_space?

I’m trying to understand rllib policies. It seems that their models do not actually respect the observation and action space:

Here is what I’m working with:

(Pdb) type(policy)
<class 'ray.rllib.policy.torch_policy_template.PPOTorchPolicy'>
(Pdb) type(policy.model)
<class 'ray.rllib.models.torch.fcnet.FullyConnectedNetwork'>
(Pdb) policy.model.obs_space
Box(-5000.0, 5000.0, (260,), float32)
(Pdb) policy.model.action_space
Discrete(4)
(Pdb) obs = torch.as_tensor(policy.model.obs_space.sample())
(Pdb) obs.size()
torch.Size([260])

I expect that the model.forward function would accept this observation as it is a direct sample of the models obs_space, and I expect it would return a Tensor of shape torch.Size([4]). However, when I run it I get this:

(Pdb) actions = policy.model.forward({"obs_flat":obs, "obs":obs}, None, None)
*** RuntimeError: mat1 and mat2 shapes cannot be multiplied (260x1 and 260x256)

And if I pass it the observation that it seems to actually expect, the returned action logits are not of the shape I expect:

(Pdb) new_obs = tensor.repeat(256, 1)
(Pdb) new_obs.size()
torch.Size([256, 260])
(Pdb) actions = policy.model.forward({"obs_flat":new_obs, "obs":new_obs}, None, None)
(Pdb) actions
(tensor([[-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        ...,
        [-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558]], grad_fn=<AddmmBackward>), None)
(Pdb) actions[0].size()
torch.Size([256, 4])

What I expect is a tensor of torch.Size([4]) like this:

(Pdb) mean_actions = torch.mean(actions[0], 0)
(Pdb) mean_actions
tensor([-0.3594,  0.5544, -1.1114,  1.4558], grad_fn=<MeanBackward1>)

What am I misunderstanding?

Hey @Eric_Adlam , thanks for posting this question. I think you got confused by the batch dimension.
The spaces stored in the policy’s and model’s obs_space/action_space properties are always the non-batched spaces, so for example: Box(-1.0, 1.0, (5, )) for a 5-dimensional space w/o(!) the batcg dimension. In your example, you are doing this correctly by adding a batch dimension (256):

Taking your example:

(Pdb) new_obs = tensor.repeat(256, 1)
(Pdb) new_obs.size()
torch.Size([256, 260])
(Pdb) actions = policy.model.forward({"obs_flat":new_obs, "obs":new_obs}, None, None)
(Pdb) actions
(tensor([[-0.3594,  0.5544, -1.1114,  1.4558],
        [-0.3594,  0.5544, -1.1114,  1.4558],
        ...,
        [-0.3594,  0.5544, -1.1114,  1.4558]], grad_fn=<AddmmBackward>), None)
(Pdb) actions[0].size()
torch.Size([256, 4])

The model correctly outputs 256 x the 4 logits for your action distribution to sample from.
So in other words, you pass in 256 different observations and - for each one - get an action back. This is the correct behavior as you need an action for each observation.

1 Like