Trajectory View API Example

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

The example manual inference loop seems to be setting the n_* values as the same value, should these not be based on the results of previous rewards/actions/observations from previous loops?

For example should that be something like:

...
obs = env.reset()
prev_observation = collections.deque([obs] * num_frames, maxlen=num_frames)
prev_action = collections.deque([0] * num_frames, maxlen=num_frames)
prev_reward = collections.deque([1.0] * num_frames, maxlen=num_frames)
while not done:
    action, state, logits = algo.compute_single_action(
        input_dict={
            "obs": obs,
            "prev_n_obs": np.stack(prev_observation),
            "prev_n_actions": np.stack(prev_action), 
            "prev_n_rewards": np.stack(prev_reward),
        },
        full_fetch=True,
    )
    obs, reward, done, info = env.step(action)
    prev_observation.appendleft(obs)
    prev_action.appendleft(action)
    prev_reward.appendleft(reward)
    episode_reward += reward
...