Trajectory View API Example

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

The example manual inference loop seems to be setting the n_* values as the same value, should these not be based on the results of previous rewards/actions/observations from previous loops?

For example should that be something like:

obs = env.reset()
prev_observation = collections.deque([obs] * num_frames, maxlen=num_frames)
prev_action = collections.deque([0] * num_frames, maxlen=num_frames)
prev_reward = collections.deque([1.0] * num_frames, maxlen=num_frames)
while not done:
    action, state, logits = algo.compute_single_action(
            "obs": obs,
            "prev_n_obs": np.stack(prev_observation),
            "prev_n_actions": np.stack(prev_action), 
            "prev_n_rewards": np.stack(prev_reward),
    obs, reward, done, info = env.step(action)
    episode_reward += reward

Anyone have any insights?

Hi @ahmedammar,

Your example code looks good to me except for one thing. You are adding more recent time steps on the opposite side as rllib does internally. It places earlier steps to the left of more recent steps [0,1,2,3] your use of append left is the opposite [3,2,1,0].
This will mean that when you do manual inference your prev_x inputs will be the reverse of the order used during training.

Also I think all the prev_x will be initialized by zeros.

1 Like

Thanks for the insight here, yeah I wasn’t sure about whether I should be appending to left or right, thanks for clarifying that.

But shouldn’t the example code I linked be updated (happy to send a PR) so that prev_x actually have previous values? Otherwise they’re always being sent as static values? Or am I missing something?

Hi @ahmedammar,

You’re welcome.

Yes ideally it would be a working example.

The line below acknowledges that it is not complete.

Perhaps someone will generate one that is and submit a pull request.

1 Like