Compute_single_action with explore=false returns the same result

Is it normal with APPO/attention net for compute_single_action to return the same result with explore=False and different observations?

I also have tried to use a trained model with a single iteration to get more random results, but consecutive calls to compute_single_action with different observations returns the same result

action, state_out, _ = self.trainer.compute_single_action(obs, state=self.state_list, explore=False)                  
self.state_list = [np.concatenate((self.state_list[i], [state_out[i]]))[1:] for i in range(self.transformer_length)]

without explore=False it returns different actions, but I think it just randomizes the actions instead of performing what it has learned

This is my observation:

self.observation_space = gym.spaces.Dict({            
            "data": gym.spaces.Box(low=-8.0, high=8.0, shape=(self.data_size,), dtype=np.float32),            
            'h1': gym.spaces.Box(low=-2.1, high=2.1, shape=(15,), dtype=np.float32),
            'h2': gym.spaces.Box(low=-1.1, high=1.1, shape=(10,), dtype=np.float32),
        })

And this is the action

self.action_space = gym.spaces.Box(
            low=0.0, high=1.0, shape=(2*3,), dtype=np.float32)
Ray v2.20
Python 3.10.10
Windows 11

I would also like to get more info about this behavior. When serving an LSTM, deactivating the exploration behavior leads to the same actions despite different observations. My assumption was that the model is trying to sort of replay the trained policy step for step, but I don’t think that’s the case.

1 Like

Did you find a solution? Have you tried on linux?