Compute_actions for Trajectory API

Hello, consider the following documentation:

https://docs.ray.io/en/master/rllib-training.html#computing-actions

There is no mention this does not apply to models using Trajectory API.
Now if you consider the following example:

this is training only.
Suppose that I want a customized replay where at the end of the replay I also render my environment, I could add something like the following:

    checkpoints = results.get_trial_checkpoints_paths(trial=results.get_best_trial('episode_reward_mean', mode='max'),
                                                                                    metric='episode_reward_mean')

    checkpoint_path = checkpoints[0][0]
    agent = agents.ppo.PPOTrainer(config, env="stateless_cartpole")
    agent.restore(checkpoint_path)

    env = StatelessCartPole()

    # run until episode ends
    for _ in range(10):
        episode_reward = 0
        reward = 0.
        action = 0
        done = False
        obs = env.reset()
        state=np.zeros(2*256, np.float32).reshape(2,256)
        # state=None
        while not done:
            action, state, logits = agent.compute_action(obs, state)
            obs, reward, done, info = env.step(action)
            episode_reward += reward

        print("reward: {}".format(episode_reward))
        env.render()

Now this code fails with the following error:

2021-07-05 17:23:10,845	WARNING deprecation.py:33 -- DeprecationWarning: `compute_action` has been deprecated. Use `compute_single_action` instead. This will raise an error in the future!
Traceback (most recent call last):
  File "trajectory_view_api.py", line 115, in <module>
    action, state, logits = agent.compute_action(obs, state)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1005, in compute_action
    return self.compute_single_action(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 986, in compute_single_action
    result = self.get_policy(policy_id).compute_single_action(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/policy.py", line 224, in compute_single_action
    out = self.compute_actions(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 239, in compute_actions
    return self._compute_action_helper(input_dict, state_batches,
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 326, in _compute_action_helper
    dist_inputs, state_out = self.model(input_dict, state_batches,
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/models/modelv2.py", line 230, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/examples/models/trajectory_view_utilizing_models.py", line 119, in forward
    obs = input_dict["prev_n_obs"]
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 500, in __getitem__
    value = dict.__getitem__(self, key)
KeyError: 'prev_n_obs'

and adding this:

    checkpoint_path = checkpoints[0][0]
    agent = agents.ppo.PPOTrainer(config, env="stateless_cartpole")
    agent.restore(checkpoint_path)

    env = StatelessCartPole()

    policy = agent.get_policy()
    # run until episode ends
    for _ in range(10):
        episode_reward = 0
        reward = 0.
        action = 0
        done = False
        obs = env.reset()
        state=np.zeros(2*256, np.float32).reshape(2,256)
        actions=np.zeros(16, np.float32).reshape(1,16)
        rewards=np.zeros(16, np.float32).reshape(1,16)

        # state=None
        while not done:
            action, state, logits = policy.compute_actions(obs, state, prev_action_batch=actions, prev_reward_batch=rewards)
            obs, reward, done, info = env.step(action)
            episode_reward += reward

        print("reward: {}".format(episode_reward))
        env.render()

fails with the following error:

2021-07-05 17:27:56,257	INFO trainable.py:390 -- Current state after restoring: {'_iteration': 10, '_timesteps_total': None, '_time_total': 30.673815488815308, '_episodes_total': 1226}
/opt/conda/lib/python3.8/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Traceback (most recent call last):
  File "trajectory_view_api.py", line 135, in <module>
    action, state, logits = policy.compute_actions(obs, state, prev_action_batch=actions, prev_reward_batch=rewards)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 237, in compute_actions
    for s in (state_batches or [])
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Exception ignored in: <function ActorHandle.__del__ at 0x7fcb8a0d3790>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ray/actor.py", line 834, in __del__

Is there any further documentation on how manually replay a trained policy after tune.run with the use of Trajectory API?
I can’t make it working.

1 Like

@mg64ve did you get any idea on how to do this. If not I have also opened a new issue on the github repository. [Bug] Compute_action for Trajectory view API · Issue #18777 · ray-project/ray (github.com)

@sven1977 could you please help and tell us how to implement this.

Hi @kapilPython , no I haven’t figured out how to fix it. I am also interested if anyone can help us. Please let me know if you get a reply in github. Thanks.

@ericl do you have any idea how to do this?

For a hacky solution people could use

agent.get_policy().compute_actions_from_input_dict({"obs": np.array([obs]), "prev_n_obs": np.expand_dims(np.stack([obs for _ in range(16)]), axis=0), "prev_n_actions": np.expand_dims(np.stack([0 for _ in range(16)]), axis=0), "prev_n_rewards": np.expand_dims(np.stack([1.0 for _ in range(16)]), axis=0)})

The Complete solution to the problem will be provided by @sven1977 through the issue:
[Bug] Compute_action for Trajectory view API · Issue #18777 · ray-project/ray (github.com)

Hey @kapilPython @mg64ve , here is the PR that provides a solution. It added a small inference loop to the trajectory_view_api.py example at the end, showing, how you can now use the Trainer.compute_single_action methods with an input dict (that contains last_n_obs, etc…):

1 Like

Hey @sven1977 , thanks for your help!
I see this PR has been merged in the master branch yesterday.
So it should already be in the night build, right?
I have tried to rebuild my docker image with the last 2.0.0 night build but it is still not working.

Hi @mg64ve,

Just double checking. Did you update the call like in the example?
Before you had:

action, state, logits = agent.compute_action(obs, state)

now it is:

This code works. Thank you @mannyv

1 Like

another question regarding trajectory_view_api.py @mannyv @sven1977 @kapilPython. Is the following code going to work even if I change the comments and I use the LSTM or attention part?
In my environment it is training but the manual inference is failing.

I’ve tried this solution with an agent trained with the following config:

{

        "env": "SimpleCryptoEnv",  # "CartPole-v0", #

        # "env_config": config_train,  # The dictionary we built before

        "log_level": "WARNING",

        "framework": "torch",

        "_fake_gpus": False,

        "callbacks": MyCallback,

        "ignore_worker_failures": True,

        "num_workers": 12,  # One worker per agent. You can increase this but it will run fewer parallel trainings.

        "num_envs_per_worker": 1,

        "num_gpus": 1,  # I yet have to understand if using a GPU is worth it, for our purposes, but I think it's not. This way you can train on a non-gpu enabled system.

        "clip_rewards": True,

        # "lr": 1e-4,  # Hyperparameter grid search defined above

        # "gamma": 0.99,  # This can have a big impact on the result and needs to be properly tuned (range is 0 to 1)

        # "lambda": 1.0,

        "observation_filter": "MeanStdFilter",

        "model": {

            "fcnet_hiddens": [256, 256],  # Hyperparameter grid search defined above

            "use_attention": True,

            "attention_use_n_prev_actions": 64,

            "attention_use_n_prev_rewards": 64,

            "vf_share_layers": True,

        },

        #"num_sgd_iter": 10,  # tune.choice([10, 20, 30]),

        "sgd_minibatch_size": 1024, # 128  # tune.choice([128, 512, 2048]),

        "train_batch_size": 32768, # , # 1024 # tune.choice([10000, 20000, 40000]),

        "evaluation_interval": 1,  # Run evaluation on every iteration

        "vf_clip_param": 300000,

        "evaluation_config": {

            "env_config": config_eval,  # The dictionary we built before (only the overriding keys to use in evaluation)

            "explore": False,  # We don't want to explore during evaluation. All actions have to be repeatable.

        },

    }

and it crashes with this error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_348/2132439779.py in <module>
----> 1 agent.compute_single_action(
      2                 input_dict={
      3                     "obs": obs,
      4                     "prev_n_obs": np.stack([obs for _ in range(num_frames)]),
      5                     "prev_n_actions": np.stack([0 for _ in range(num_frames)]),

~\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py in compute_single_action(self, observation, state, prev_action, prev_reward, info, input_dict, policy_id, full_fetch, explore, timestep, episode, unsquash_action, clip_action, unsquash_actions, clip_actions, **kwargs)
   1483         if input_dict is not None:
   1484             input_dict[SampleBatch.OBS] = observation
-> 1485             action, state, extra = policy.compute_single_action(
   1486                 input_dict=input_dict,
   1487                 explore=explore,

~\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\policy.py in compute_single_action(self, obs, state, prev_action, prev_reward, info, input_dict, episode, explore, timestep, **kwargs)
    216             episodes = [episode]
    217 
--> 218         out = self.compute_actions_from_input_dict(
    219             input_dict=SampleBatch(input_dict),
    220             episodes=episodes,

~\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py in compute_actions_from_input_dict(self, input_dict, explore, timestep, **kwargs)
    292                 if state_batches else None
    293 
--> 294             return self._compute_action_helper(input_dict, state_batches,
    295                                                seq_lens, explore, timestep)
    296 

~\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\utils\threading.py in wrapper(self, *a, **k)
     19         try:
     20             with self._lock:
---> 21                 return func(self, *a, **k)
     22         except AttributeError as e:
     23             if "has no attribute '_lock'" in e.args[0]:

~\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py in _compute_action_helper(self, input_dict, state_batches, seq_lens, explore, timestep)
    932             else:
    933                 dist_class = self.dist_class
--> 934                 dist_inputs, state_out = self.model(input_dict, state_batches,
    935                                                     seq_lens)
    936 

~\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\modelv2.py in __call__(self, input_dict, state, seq_lens)
    241 
    242         with self.context():
--> 243             res = self.forward(restored, state or [], seq_lens)
    244 
    245         if isinstance(input_dict, SampleBatch):

~\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\attention_net.py in forward(self, input_dict, state, seq_lens)
    345                 state: List[TensorType],
    346                 seq_lens: TensorType) -> (TensorType, List[TensorType]):
--> 347         assert seq_lens is not None
    348         # Push obs through "unwrapped" net's `forward()` first.
    349         wrapped_out, _ = self._wrapped_forward(input_dict, [], None)

AssertionError: 

Even when using the hack:

agent.compute_single_action(
                input_dict={
                    "obs": obs,
                    "prev_n_obs": np.stack([obs for _ in range(num_frames)]),
                    "prev_n_actions": np.stack([0 for _ in range(num_frames)]),
                    "prev_n_rewards": np.stack([1.0 for _ in range(num_frames)]),
                },
                full_fetch=True,
            )

I really don’t understand why this doesn’t crash during training but crashes when manually evaluating the agent…