I’m trying to understand rllib policies. It seems that their models do not actually respect the observation and action space:
Here is what I’m working with:
(Pdb) type(policy)
<class 'ray.rllib.policy.torch_policy_template.PPOTorchPolicy'>
(Pdb) type(policy.model)
<class 'ray.rllib.models.torch.fcnet.FullyConnectedNetwork'>
(Pdb) policy.model.obs_space
Box(-5000.0, 5000.0, (260,), float32)
(Pdb) policy.model.action_space
Discrete(4)
(Pdb) obs = torch.as_tensor(policy.model.obs_space.sample())
(Pdb) obs.size()
torch.Size([260])
I expect that the model.forward function would accept this observation as it is a direct sample of the models obs_space, and I expect it would return a Tensor of shape torch.Size([4]). However, when I run it I get this:
(Pdb) actions = policy.model.forward({"obs_flat":obs, "obs":obs}, None, None)
*** RuntimeError: mat1 and mat2 shapes cannot be multiplied (260x1 and 260x256)
And if I pass it the observation that it seems to actually expect, the returned action logits are not of the shape I expect:
(Pdb) new_obs = tensor.repeat(256, 1)
(Pdb) new_obs.size()
torch.Size([256, 260])
(Pdb) actions = policy.model.forward({"obs_flat":new_obs, "obs":new_obs}, None, None)
(Pdb) actions
(tensor([[-0.3594, 0.5544, -1.1114, 1.4558],
[-0.3594, 0.5544, -1.1114, 1.4558],
[-0.3594, 0.5544, -1.1114, 1.4558],
...,
[-0.3594, 0.5544, -1.1114, 1.4558],
[-0.3594, 0.5544, -1.1114, 1.4558],
[-0.3594, 0.5544, -1.1114, 1.4558]], grad_fn=<AddmmBackward>), None)
(Pdb) actions[0].size()
torch.Size([256, 4])
What I expect is a tensor of torch.Size([4]) like this:
(Pdb) mean_actions = torch.mean(actions[0], 0)
(Pdb) mean_actions
tensor([-0.3594, 0.5544, -1.1114, 1.4558], grad_fn=<MeanBackward1>)
What am I misunderstanding?