How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
The example manual inference loop seems to be setting the n_*
values as the same value, should these not be based on the results of previous rewards/actions/observations
from previous loops?
For example should that be something like:
...
obs = env.reset()
prev_observation = collections.deque([obs] * num_frames, maxlen=num_frames)
prev_action = collections.deque([0] * num_frames, maxlen=num_frames)
prev_reward = collections.deque([1.0] * num_frames, maxlen=num_frames)
while not done:
action, state, logits = algo.compute_single_action(
input_dict={
"obs": obs,
"prev_n_obs": np.stack(prev_observation),
"prev_n_actions": np.stack(prev_action),
"prev_n_rewards": np.stack(prev_reward),
},
full_fetch=True,
)
obs, reward, done, info = env.step(action)
prev_observation.appendleft(obs)
prev_action.appendleft(action)
prev_reward.appendleft(reward)
episode_reward += reward
...