TrajectoryTracking with RLLIB

mg64ve · November 10, 2021, 5:35pm

Hello, I am working to a a project and I have implemented the following python notebook in this github repo:

The aim of this environment is to follow a trajectory with discrete actions (0,1,2).
The following is a short description:

to put in short the agent should use discrete actions to follow the curve.
An agent that already know the futures steps of the trajectory can take the right decisions and maximize the rewards.
I have written 2 of these data leak agents, the following are the results:

As you can see the first is much better than the second and it correspond to the use of the action 1 (Stay) whenever is necessary:

So it turns out that the agent that uses only Up and Down action gets worst performance.
Then I have used PPO with LSTM to try to find out the optimal solution.
And the result is that RLLIB the agent can only achieve a partial result since it is using only Up and Down actions.
It seems it does not find the best solution.
It sound like it does not explore very good the environment.
Is there any parameter that can help me to improve this?

arturn · November 11, 2021, 2:51pm

Hi,
That is a lot of code for an agent that is only supposed to follow a signal!
Also, a couple of things here confuse me. Such as this equation:
Bildschirmfoto 2021-11-11 um 15.36.40
I suppose the “agent value” is the value function? Could you explain?
Also, in your code I find the following lines:

if action == FollowingActions.Up.value:
  self.agent_value += self.step_value * np.abs(self.agent_value - self.up_bound)
elif action == FollowingActions.Down.value:
  self.agent_value -= self.step_value * np.abs(self.agent_value - self.lower_bound)
  self.raw_reward = -self.raw_reward
elif action == FollowingActions.Stay.value:
  self.raw_reward = .0

Even though I have trouble getting to what you are doing, the results look good, don’t they?
The red dash-dotted line represents the actions that your agent chose after training properly, right?
If you are not satisfied with them, I think you can try and reduce what you call the “cost” in your environment to get your agent to be closer to your sample signal.

In the graphs that you posted there is no such thing as exploring as far as I can see.

mg64ve · November 11, 2021, 8:32pm

Thanks @arturn for your reply.
The agent value is an helper function and it used to calculate the reward. The reward is higher if the agent value is close to the trajectory value.
That been said, what I expect from RLLIB is that find out the actions in order to maximize the rewards.
In my opinion the agent should take action similar to the following (agent value is dash-point, the blue is the trajectory, and the dash lines are the boundaries):

that means, the agent should learn the trajectory from the observation and just before the trajectory change direction, it should change from action 2 (up) to action 1 (stay) or vice-versa from action 0 (down) to action 1 (stay). This is the behavior that maximize the rewards.
Instead what I am getting from RLLIB is:

that means the agent always changes from action 0 (down) to action 2 (up) and vice-versa without choosing action 1 (stay). Doing this, the actions looses a lot of rewards.
I have used different values for self.fee = 1.0, .5, .1 but the result is similar and with self.fee = .1 it is even worst:

The agent changes too much the action but only among action 0 and 2:

It is not easy to explain all details. If have any further question, please let me know.
Thanks.

mannyv · November 11, 2021, 8:47pm

@mg64ve you might want to try increasing the max _seq_len. Which version of rllib are you using? There is a bug with rnn_sequencing in the current release.

github.com/ray-project/ray

[Bug] [rllib] RNN sequencing is incorrect

opened 12:43PM - 02 Nov 21 UTC

smorad

bug triage rllib

### Search before asking - [X] I searched the [issues](https://github.com/ray…-project/ray/issues) and found no similar issues. ### Ray Component RLlib ### What happened + What you expected to happen I would expect given two sequences `A, B`: `[A, A, A, B, B]; seq_lens=[3, 2], obs.shape = [5, 1]` would be padded to `[A, A, A, B, B, *]; seq_lens=[3, 2], obs.shape = [2, 3, 1]` This does not appear to be the case. For some reason rllib zero-pads `obs` to something besides `seq_lens.max()`. Even more worrisome is calling `torch.nonzero()` on the `input_dict`, which shows front-padded zeros to the observations. For example, printing `input_dict['obs'].reshape(B, T, -1) == 0` results in: ``` (PPO pid=74137) [[ True, True], (PPO pid=74137) [ True, True], (PPO pid=74137) [ True, True], (PPO pid=74137) [ True, True], (PPO pid=74137) [ True, True], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [False, False], (PPO pid=74137) [ True, True], (PPO pid=74137) [ True, True], (PPO pid=74137) [ True, True]], ``` The zero-padding is clearly messed up, the first five observations have been zero-padded and then we have real observations offset by five. ### Versions / Dependencies Linux Ray 1.7.0 ### Reproduction script Feel free to play with the `USE_CORRECT_SHAPE` flag ```python import torch import numpy as np import gym from typing import Union, Dict, List, Tuple, Any import ray from ray.rllib.models.torch.torch_modelv2 import TorchModelV2 from ray.rllib.utils.typing import ModelConfigDict, TensorType from ray.rllib.policy.rnn_sequencing import add_time_dimension from ray.tune import register_env from ray.rllib.agents.ppo import PPOTrainer from ray.rllib.examples.env.stateless_cartpole import StatelessCartPole # Pad to the correct size and crash # or follow the rnn_sequencing code and don't crash USE_CORRECT_SHAPE = False class TestRNN(TorchModelV2, torch.nn.Module): def __init__( self, obs_space: gym.spaces.Space, action_space: gym.spaces.Space, num_outputs: int, model_config: ModelConfigDict, name: str, **custom_model_kwargs, ): TorchModelV2.__init__( self, obs_space, action_space, num_outputs, model_config, name ) torch.nn.Module.__init__(self) self.num_outputs = num_outputs self.input_dim = gym.spaces.utils.flatdim(obs_space) self.act_space = action_space self.act_dim = gym.spaces.utils.flatdim(action_space) self.cur_val = None self.policy = torch.nn.Linear(self.input_dim, self.act_dim) self.vf = torch.nn.Linear(self.input_dim, 1) def get_initial_state(self): return [torch.zeros(0)] def value_function(self): assert self.cur_val is not None, "must call forward() first" return self.cur_val def forward( self, input_dict: Dict[str, TensorType], state: List[TensorType], seq_lens: TensorType, ) -> Tuple[TensorType, List[TensorType]]: flat = input_dict["obs_flat"] if USE_CORRECT_SHAPE: max_seq_len = seq_lens.max() else: # max_seq_len here is copied from rllib RNN code # see https://github.com/ray-project/ray/blob/2d24ef0d3234867ac329b10ae3a11b9b7119d17b/rllib/models/torch/recurrent_net.py#L75 # but it doesn't make sense... # it should be max_seq_len = seq_len.max() max_seq_len = flat.shape[0] // seq_lens.shape[0] padded = add_time_dimension( flat, max_seq_len=max_seq_len, framework="torch", time_major=False ) B = padded.shape[0] T = padded.shape[1] # If this fails, then we have "extra" padding in the RNN # We shouldn't need to pad the time dimension more than the longest # sequence if seq_lens.max() != T: print(f'seq_lens.max() is {seq_lens.max()} but input temporal dim is {T}') print(flat.reshape(B, T, -1) == 0) raise Exception('seq_len mismatch') flattened = padded.reshape(-1, padded.shape[-1]) logits = self.policy(flattened) self.cur_val = self.vf(flattened).squeeze(1) state = state return logits, state register_env(StatelessCartPole.__name__, StatelessCartPole) MAX_SEQ_LEN = 200 CFG = { "env_config": {}, "framework": "torch", "model": { "custom_model": TestRNN, "max_seq_len": MAX_SEQ_LEN, }, "num_workers": 0, "num_gpus": 0, "env": StatelessCartPole, "horizon": MAX_SEQ_LEN, } ray.init(object_store_memory=3e10) analysis = ray.tune.run( PPOTrainer, config=CFG, ) ``` ### Anything else Every train step ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR!

arturn · November 11, 2021, 9:37pm

Hi @mg64ve,

Your observations are defined by the following function:

def get_samples(self):
        c = self.data[self.position]
        c1 = self.data[self.position-1]
        c2 = self.data[self.position-2]
        c3 = self.data[self.position-3]
        return np.array(((c-c1)/c1,(c-c2)/c2,(c-c3)/c3))

To me it looks like these observations only include information on the blue curve. But not on the red one. Am I right?
So all the agent can do is look out for changes in the curvature of the blue curve and maybe these changes have to be somewhat large for the agent to “tip”.

Where did you get the “good” graph from? Where the agent performs to your expectations by choosing action 1 from time to tome? Is in not the outcome of an RLlib experiment?

Cheers

mg64ve · November 11, 2021, 9:43pm

@mannyv I am using version 1.8.0. With max_seq_len=50 the situation does not improve, with max_seq_len=200 I am getting the following error:


2021-11-11 22:40:32,046	ERROR trial_runner.py:924 -- Trial PPO_TrajectoryTrackingEnv_f8d8b_00000: Error processing event.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/ray/tune/trial_runner.py", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/opt/conda/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/opt/conda/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/ray/worker.py", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::PPO.train() (pid=4077, ip=172.18.0.6, repr=PPO)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 45, in ppo_surrogate_loss
    logits, state = model(train_batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 243, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/torch/recurrent_net.py", line 187, in forward
    wrapped_out, _ = self._wrapped_forward(input_dict, [], None)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/torch/fcnet.py", line 123, in forward
    self._last_flat_in = obs.reshape(obs.shape[0], -1)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

The above exception was the direct cause of the following exception:

ray::PPO.train() (pid=4077, ip=172.18.0.6, repr=PPO)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 682, in train
    raise e
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 668, in train
    result = Trainable.train(self)
  File "/opt/conda/lib/python3.9/site-packages/ray/tune/trainable.py", line 283, in train
    result = self.step()
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/trainer_template.py", line 206, in step
    step_results = next(self.train_exec_impl)
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/iter.py", line 791, in apply_foreach
    result = fn(item)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/execution/train_ops.py", line 197, in __call__
    results = policy.learn_on_loaded_batch(
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 607, in learn_on_loaded_batch
    return self.learn_on_batch(batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 507, in learn_on_batch
    grads, fetches = self.compute_gradients(postprocessed_batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/policy_template.py", line 336, in compute_gradients
    return parent_cls.compute_gradients(self, batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 678, in compute_gradients
    tower_outputs = self._multi_gpu_parallel_grad_calc(
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/torch_policy.py", line 1052, in _multi_gpu_parallel_grad_calc
    raise last_result[0] from last_result[1]
ValueError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
In tower 0 on device cpu

mg64ve · November 11, 2021, 9:53pm

Exactly @arturn !
In the following example:

I am considering 2 agents that know the future values of the trajectory:

The first makes much better rewards than the second. Both they know when the trajectory is going to change. The second uses only 2 actyions [0,2] but the first uses all three actions [0,1,2]
The following is an example of results:

mannyv · November 11, 2021, 10:06pm

@mg64ve,

Did you see the issue I linked? I think you are experiencing it.

mg64ve · November 11, 2021, 10:10pm

Thanks @mannyv . I had a quick look but I need to more time to deep dive into it. Is this only for 1.7.0 or for all versions?

mannyv · November 11, 2021, 10:15pm

@mg64ve

You can a avoid the issue even in the current version by doing 2 things.

Set the config[“simple_optimizer”]=True
Set sgd_minibatch_size > max_seq_len

mg64ve · November 12, 2021, 9:29am

Hi @mannyv , I have tested in both 1.8.0 and 2.0.0.dev0 with the following configuration:

and the hint you gave me does not solve the problem. I have same a similar behaviour:

It looks it now uses action 1 (stay) but not really when it is better to use.
I am using 100 epochs, do you think training more would be better?

mg64ve · November 12, 2021, 12:02pm

Just trained 250 epochs and it seems better:

mannyv · November 12, 2021, 4:31pm

@mg64ve,

That is looking a lot better. Still some room for improvement though. How do the rewards compare to your hardcoded strategies with future knowledge?

mg64ve · November 12, 2021, 10:14pm

Not sure yet @mannyv . I am trying to analyze the results.

mg64ve · November 17, 2021, 7:39am

I think results are not so bad. One more question @mannyv regarding inference. Is this part correct?

Or it should be more similar to this?

Topic		Replies	Views
Unable to replicate original PPO performance RLlib	0	177	May 10, 2024
There was an error changing the trajecy_tory_view_api into continuous action space RLlib	7	597	February 22, 2022
Ray RLLIB PPO does not solve very simple problem Configure Algorithm, Training, Evaluation, Scaling	2	466	November 8, 2023
My RLlib implementation seems to compute random actions RLlib	4	918	February 15, 2022
[Rllib] compute_single_action() with an LSTM-PPO trainer fails RLlib	1	981	February 3, 2023

TrajectoryTracking with RLLIB

Related topics