Is it normal with APPO/attention net for compute_single_action to return the same result with explore=False and different observations?
I also have tried to use a trained model with a single iteration to get more random results, but consecutive calls to compute_single_action with different observations returns the same result
action, state_out, _ = self.trainer.compute_single_action(obs, state=self.state_list, explore=False)
self.state_list = [np.concatenate((self.state_list[i], [state_out[i]]))[1:] for i in range(self.transformer_length)]
without explore=False it returns different actions, but I think it just randomizes the actions instead of performing what it has learned
This is my observation:
self.observation_space = gym.spaces.Dict({
"data": gym.spaces.Box(low=-8.0, high=8.0, shape=(self.data_size,), dtype=np.float32),
'h1': gym.spaces.Box(low=-2.1, high=2.1, shape=(15,), dtype=np.float32),
'h2': gym.spaces.Box(low=-1.1, high=1.1, shape=(10,), dtype=np.float32),
})
And this is the action
self.action_space = gym.spaces.Box(
low=0.0, high=1.0, shape=(2*3,), dtype=np.float32)
Ray v2.20
Python 3.10.10
Windows 11