RNN L2 weights regularization

Hi @mg64ve,

You need to reshape the prev_actions to a 2d tensor before passing them to torch_one_hot

Step 1: Reshape to [batch_size*timesteps, - 1]
Step 2: Call torch_one_hot
Step3: Reshape to [batch_size, - 1]

I cannot create example code today but if this does not make sense or you get stuck I am happy to provide some tomorrow.

Thanks @Sertingolix and @mannyv for your comments.
However here we do not have any information regarding batch size.
Look how observations and rewards are reshaped:

        obs = torch.reshape(obs,
                            [-1, self.obs_space.shape[0] * self.num_frames])
        rewards = torch.reshape(input_dict["prev_n_rewards"],
                                [-1, self.num_frames])

self.num_frames is 16 and it is time steps.
Since afterwards it is doing:

input_ = torch.cat([obs, actions, rewards], dim=-1)

putting all together in the same tensor, I believe we need to keep same format.
Is the batch_size the seq_lens that you mean?
I can see it here:

    def forward(self, input_dict, states, seq_lens):

I use self._last_batch_size = input_dict["obs"].shape[0] for the last batch size.

@mg64ve,

Sure we do.

prev_actions = input_dict["prev_n_actions"]
pa_batch_size = prev_actions.shape[0]
pa_timesteps = prev_actions.shape[1]

Thanks @Sertingolix .
I have tried but the following:

def forward(self, input_dict, states, seq_lens):
    obs = input_dict["prev_n_obs"]
    self._last_batch_size = input_dict["obs"].shape[0]
    obs = torch.reshape(obs,
                        [-1, self.obs_space.shape[0] * self.num_frames])
    rewards = torch.reshape(input_dict["prev_n_rewards"],
                            [-1, self.num_frames])
    print('{}'.format(input_dict["prev_n_actions"].shape))
    actions = torch.reshape(input_dict["prev_n_actions"],
                            [-1, self.num_frames * self._last_batch_size])

    actions = torch_one_hot(actions,
                            self.action_space)

    actions = torch.reshape(actions,
                            [-1, self._last_batch_size])

    input_ = torch.cat([obs, actions, rewards], dim=-1)
    features = self.layer1(input_)
    features = self.layer2(features)
    out = self.out(features)
    self._last_value = self.values(features)
    return out, []                            

fails with the following error:

(pid=23507)     actions = torch.reshape(actions,
(pid=23507) RuntimeError: shape '[-1, 32]' is invalid for input of size 8

and the following:

def forward(self, input_dict, states, seq_lens):
    obs = input_dict["prev_n_obs"]
    self._last_batch_size = input_dict["obs"].shape[0]
    obs = torch.reshape(obs,
                        [-1, self.obs_space.shape[0] * self.num_frames])
    rewards = torch.reshape(input_dict["prev_n_rewards"],
                            [-1, self.num_frames])
    print('{}'.format(input_dict["prev_n_actions"].shape))
    actions = torch.reshape(input_dict["prev_n_actions"],
                            [-1, self._last_batch_size])
    actions = torch_one_hot(actions,
                            self.action_space)
    actions = torch.reshape(actions,
                            [-1, self._last_batch_size])

    input_ = torch.cat([obs, actions, rewards], dim=-1)
    features = self.layer1(input_)
    features = self.layer2(features)
    out = self.out(features)
    self._last_value = self.values(features)
    return out, []                            

it fails with the following error:

(pid=23722)     input_ = torch.cat([obs, actions, rewards], dim=-1)
(pid=23722) RuntimeError: torch.cat(): Sizes of tensors must match except in dimension 1. Got 32 and 4 in dimension 0 (The offending index is 1)

Thanks @mannyv , but it shouldn’t be:

pa_timesteps = self.num_frames

?

@mannyv unfortunately also the following fails:

def forward(self, input_dict, states, seq_lens):
    obs = input_dict["prev_n_obs"]
    obs = torch.reshape(obs,
                        [-1, self.obs_space.shape[0] * self.num_frames])
    rewards = torch.reshape(input_dict["prev_n_rewards"],
                            [-1, self.num_frames])

    actions = input_dict["prev_n_actions"]
    pa_batch_size = actions.shape[0]
    pa_timesteps = actions.shape[1]

    actions = torch.reshape(actions,
                            [-1, pa_batch_size * pa_timesteps])

    actions = torch_one_hot(actions,
                            self.action_space)

    actions = torch.reshape(actions,
                            [-1, pa_batch_size])

    input_ = torch.cat([obs, actions, rewards], dim=-1)
    features = self.layer1(input_)
    features = self.layer2(features)
    out = self.out(features)
    self._last_value = self.values(features)
    return out, []

with the following error:

(pid=24151)     actions = torch.reshape(actions,
(pid=24151) RuntimeError: shape '[-1, 32]' is invalid for input of size 8

instead of -1 you should have the batch dimension. Your proposal could work, but it’s usually a bit dangerous to mix samples along the batch dimension.

    pa_batch_size = actions.shape[0]
    pa_timesteps = actions.shape[1]
    pa_actions = actions.shape[2]

    actions = torch.reshape(actions,
                            [pa_batch_size , pa_timesteps*pa_actions])

    actions = torch_one_hot(actions,
                            self.action_space)

    actions = torch.reshape(actions,
                            [pa_batch_size,-1])

Hi @Sertingolix,

actions = torch.reshape(actions, [pa_batch_size , pa_timesteps*pa_actions])

I don’t think this will work with the way torch_one_hot is written. It is expecting that every row is 1 miltidiscrete action but the code above makes every row 16 miltidiscrete actions.It is going to ignore a lot of actions.

Why do you say it is dangerous to mix samples along the batch dimension?

The version @mg64ve needs is
actions = torch.reshape(actions, [pa_batch_size*pa_timesteps, pa_actions])

You are absolutely right. The nvec dimension would not add up.

use @mannyv s proposel

Sometimes it works, but other times , it fails silently and this is a pain to debug. In this case it should be ok, as we directly reshape it again.

@mannyv what do you mean with pa_actions ?

based on this definition

1 Like

Thanks @Sertingolix .
I am using now the following code:

def forward(self, input_dict, states, seq_lens):
    obs = input_dict["prev_n_obs"]
    obs = torch.reshape(obs,
                        [-1, self.obs_space.shape[0] * self.num_frames])
    rewards = torch.reshape(input_dict["prev_n_rewards"],
                            [-1, self.num_frames])

    actions = input_dict["prev_n_actions"]
    pa_batch_size = actions.shape[0]
    pa_timesteps = actions.shape[1]
    pa_actions = actions.shape[2]

    actions = torch.reshape(actions,
                            [pa_batch_size * pa_timesteps, pa_actions])

    actions = torch_one_hot(actions,
                            self.action_space)

    actions = torch.reshape(actions,
                            [-1, pa_batch_size])

    print('{}, {}, {}', actions.shape, obs.shape, rewards.shape)

    input_ = torch.cat([obs, actions, rewards], dim=-1)
    features = self.layer1(input_)
    features = self.layer2(features)
    out = self.out(features)
    self._last_value = self.values(features)
    return out, []

It prints the following:

(pid=24616) {}, {}, {} torch.Size([64, 32]) torch.Size([32, 32]) torch.Size([32, 16])

but it now fails in the torch.cat:

(pid=24617) RuntimeError: torch.cat(): Sizes of tensors must match except in dimension 1. Got 32 and 64 in dimension 0 (The offending index is 1)

@mg64ve

You need to put the batch size first not second.

@Sertingolix I see what you mean about dangerous now.

Thanks @mannyv , now it works.
But what about the rollout part?

    env = StatelessCartPole()

    # run until episode ends
    for _ in range(10):
        episode_reward = 0
        reward = 0.
        action = 0
        done = False
        obs = env.reset()
        state=np.zeros(2*256, np.float32).reshape(2,256)
        # state=None
        while not done:
            action, state, logits = agent.compute_action(obs, state)
            obs, reward, done, info = env.step(action)
            episode_reward += reward

        print("reward: {}".format(episode_reward))

this fails with the following error:

Traceback (most recent call last):
  File "trajectory_view_api.py", line 113, in <module>
    action, state, logits = agent.compute_action(obs, state)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 952, in compute_action
    result = self.get_policy(policy_id).compute_single_action(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/policy.py", line 214, in compute_single_action
    out = self.compute_actions(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 238, in compute_actions
    return self._compute_action_helper(input_dict, state_batches,
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 325, in _compute_action_helper
    dist_inputs, state_out = self.model(input_dict, state_batches,
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/models/modelv2.py", line 234, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "/srv/docker/ray/examples/attentionMD/trajectory_view_utilizing_models.py", line 57, in forward
    obs = input_dict["prev_n_obs"]
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 492, in __getitem__
    value = dict.__getitem__(self, key)
KeyError: 'prev_n_obs'
Exception ignored in: <function ActorHandle.__del__ at 0x7f982d5eb0d0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ray/actor.py", line 823, in __del__
AttributeError: 'NoneType' object has no attribute 'global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x7f982d5eb0d0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ray/actor.py", line 823, in __del__
AttributeError: 'NoneType' object has no attribute 'global_worker'

Is the error in how I build the initial state?
How should I do?

Take a look here RLlib Package Reference — Ray v2.0.0.dev0 and search for compute_actions.

You will notice that it takes an optional:

  • prev_action_batch ( Union [ List [ TensorType ] , TensorType ] ) – Batch of previous action values.
  • prev_reward_batch ( Union [ List [ TensorType ] , TensorType ] ) – Batch of previous rewards.

Since you are using those in your policy, you are going to have to provide them to the compute_actions call. Currently your sample loop does not do that. You will need to create an initial history for them (of size 16 in your case) probably with all 0 but that is up to you. Then every time you get a new action or reward you will have to shift everything left and add it to the end.

I got it @mannyv .
I had Ray 1.4 in my docker but I have now upgraded:

I am using the following code:

    env = StatelessCartPole()

    # run until episode ends
    for _ in range(10):
        episode_reward = 0
        reward = 0.
        action = 0
        done = False
        obs = env.reset()
        state=np.zeros(2*256, np.float32).reshape(2,256)
        actions=np.zeros(2*16, np.float32).reshape(2,16)
        rewards=np.zeros(16, np.float32)
        # state=None
        while not done:
            action, state, logits = agent.compute_action(obs, state,prev_action_batch=actions, prev_reward_batch=rewards)
            obs, reward, done, info = env.step(action)
            actions[:,:-1] = actions[:,1:]
            actions[:, -1] = action
            rewards[:-1] = rewards[1:]
            rewards[-1] = reward
            episode_reward += reward

        print("reward: {}".format(episode_reward))

But I am getting the following error:

2021-07-02 18:59:31,137	INFO trainable.py:390 -- Current state after restoring: {'_iteration': 10, '_timesteps_total': None, '_time_total': 40.44977068901062, '_episodes_total': 2029}
/opt/conda/lib/python3.8/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
2021-07-02 18:59:31,139	WARNING deprecation.py:33 -- DeprecationWarning: `compute_action` has been deprecated. Use `compute_single_action` instead. This will raise an error in the future!
Traceback (most recent call last):
  File "trajectory_view_api.py", line 115, in <module>
    action, state, logits = agent.compute_action(obs, state,prev_action_batch=actions, prev_reward_batch=rewards)
  File "/home/condauser/.local/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1005, in compute_action
    return self.compute_single_action(*args, **kwargs)
TypeError: compute_single_action() got an unexpected keyword argument 'prev_action_batch'

weird

I think is not properly installed. Let me install it again.

@mg64ve
There must be something we are missing when doing manual environment interactions with the trajectory view api.

I had to make two changes for it to train for me:

  1. In forward the actions were being created as ints but it wanted them as floats so change the last actions definition to: actions = torch.reshape(actions, [pa_batch_size,-1]).type_as(obs)

  2. Instead of this

for _ in range(10):
        episode_reward = 0
        reward = 0.
        action = 0
        done = False
        obs = env.reset()
        state=np.zeros(2*256, np.float32).reshape(2,256)
        actions=np.zeros(2*16, np.float32).reshape(2,16)
        rewards=np.zeros(16, np.float32)
        # state=None
        while not done:
            action, state, logits = agent.compute_action(obs, state,prev_action_batch=actions, prev_reward_batch=rewards)
            obs, reward, done, info = env.step(action)
            actions[:,:-1] = actions[:,1:]
            actions[:, -1] = action
            rewards[:-1] = rewards[1:]
            rewards[-1] = reward
            episode_reward += reward

        print("reward: {}".format(episode_reward))

replace it with:
tune.run("PPO",config=config,stop=stop)

That ran correctly for me.

@mannyv I am already doing tune.run and it works:

What I want to do is: after tune.run I want to use the trained model and I want also to render the results.