TrajectoryTracking with RLLIB

Hello, I am working to a a project and I have implemented the following python notebook in this github repo:

The aim of this environment is to follow a trajectory with discrete actions (0,1,2).
The following is a short description:

to put in short the agent should use discrete actions to follow the curve.
An agent that already know the futures steps of the trajectory can take the right decisions and maximize the rewards.
I have written 2 of these data leak agents, the following are the results:


As you can see the first is much better than the second and it correspond to the use of the action 1 (Stay) whenever is necessary:

So it turns out that the agent that uses only Up and Down action gets worst performance.
Then I have used PPO with LSTM to try to find out the optimal solution.
And the result is that RLLIB the agent can only achieve a partial result since it is using only Up and Down actions.
It seems it does not find the best solution.
It sound like it does not explore very good the environment.
Is there any parameter that can help me to improve this?

That is a lot of code for an agent that is only supposed to follow a signal!
Also, a couple of things here confuse me. Such as this equation:
Bildschirmfoto 2021-11-11 um 15.36.40
I suppose the “agent value” is the value function? Could you explain?
Also, in your code I find the following lines:

if action == FollowingActions.Up.value:
  self.agent_value += self.step_value * np.abs(self.agent_value - self.up_bound)
elif action == FollowingActions.Down.value:
  self.agent_value -= self.step_value * np.abs(self.agent_value - self.lower_bound)
  self.raw_reward = -self.raw_reward
elif action == FollowingActions.Stay.value:
  self.raw_reward = .0

Even though I have trouble getting to what you are doing, the results look good, don’t they?
The red dash-dotted line represents the actions that your agent chose after training properly, right?
If you are not satisfied with them, I think you can try and reduce what you call the “cost” in your environment to get your agent to be closer to your sample signal.

In the graphs that you posted there is no such thing as exploring as far as I can see.

Thanks @arturn for your reply.
The agent value is an helper function and it used to calculate the reward. The reward is higher if the agent value is close to the trajectory value.
That been said, what I expect from RLLIB is that find out the actions in order to maximize the rewards.
In my opinion the agent should take action similar to the following (agent value is dash-point, the blue is the trajectory, and the dash lines are the boundaries):

that means, the agent should learn the trajectory from the observation and just before the trajectory change direction, it should change from action 2 (up) to action 1 (stay) or vice-versa from action 0 (down) to action 1 (stay). This is the behavior that maximize the rewards.
Instead what I am getting from RLLIB is:

that means the agent always changes from action 0 (down) to action 2 (up) and vice-versa without choosing action 1 (stay). Doing this, the actions looses a lot of rewards.
I have used different values for self.fee = 1.0, .5, .1 but the result is similar and with self.fee = .1 it is even worst:

The agent changes too much the action but only among action 0 and 2:

It is not easy to explain all details. If have any further question, please let me know.

@mg64ve you might want to try increasing the max _seq_len. Which version of rllib are you using? There is a bug with rnn_sequencing in the current release.

Hi @mg64ve,

Your observations are defined by the following function:

def get_samples(self):
        c =[self.position]
        c1 =[self.position-1]
        c2 =[self.position-2]
        c3 =[self.position-3]
        return np.array(((c-c1)/c1,(c-c2)/c2,(c-c3)/c3))

To me it looks like these observations only include information on the blue curve. But not on the red one. Am I right?
So all the agent can do is look out for changes in the curvature of the blue curve and maybe these changes have to be somewhat large for the agent to “tip”.

Where did you get the “good” graph from? Where the agent performs to your expectations by choosing action 1 from time to tome? Is in not the outcome of an RLlib experiment?


@mannyv I am using version 1.8.0. With max_seq_len=50 the situation does not improve, with max_seq_len=200 I am getting the following error:

2021-11-11 22:40:32,046	ERROR -- Trial PPO_TrajectoryTrackingEnv_f8d8b_00000: Error processing event.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/ray/tune/", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/opt/conda/lib/python3.9/site-packages/ray/tune/", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/opt/conda/lib/python3.9/site-packages/ray/_private/", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/ray/", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::PPO.train() (pid=4077, ip=, repr=PPO)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/ppo/", line 45, in ppo_surrogate_loss
    logits, state = model(train_batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/", line 243, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/torch/", line 187, in forward
    wrapped_out, _ = self._wrapped_forward(input_dict, [], None)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/torch/", line 123, in forward
    self._last_flat_in = obs.reshape(obs.shape[0], -1)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

The above exception was the direct cause of the following exception:

ray::PPO.train() (pid=4077, ip=, repr=PPO)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/", line 682, in train
    raise e
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/", line 668, in train
    result = Trainable.train(self)
  File "/opt/conda/lib/python3.9/site-packages/ray/tune/", line 283, in train
    result = self.step()
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/agents/", line 206, in step
    step_results = next(self.train_exec_impl)
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 756, in __next__
    return next(self.built_iterator)
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 843, in apply_filter
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 843, in apply_filter
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.9/site-packages/ray/util/", line 791, in apply_foreach
    result = fn(item)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/execution/", line 197, in __call__
    results = policy.learn_on_loaded_batch(
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/", line 607, in learn_on_loaded_batch
    return self.learn_on_batch(batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/utils/", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/", line 507, in learn_on_batch
    grads, fetches = self.compute_gradients(postprocessed_batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/", line 336, in compute_gradients
    return parent_cls.compute_gradients(self, batch)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/utils/", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/", line 678, in compute_gradients
    tower_outputs = self._multi_gpu_parallel_grad_calc(
  File "/opt/conda/lib/python3.9/site-packages/ray/rllib/policy/", line 1052, in _multi_gpu_parallel_grad_calc
    raise last_result[0] from last_result[1]
ValueError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
In tower 0 on device cpu

Exactly @arturn !
In the following example:

I am considering 2 agents that know the future values of the trajectory:

The first makes much better rewards than the second. Both they know when the trajectory is going to change. The second uses only 2 actyions [0,2] but the first uses all three actions [0,1,2]
The following is an example of results:



Did you see the issue I linked? I think you are experiencing it.

Thanks @mannyv . I had a quick look but I need to more time to deep dive into it. Is this only for 1.7.0 or for all versions?


You can a avoid the issue even in the current version by doing 2 things.

  1. Set the config[“simple_optimizer”]=True
  2. Set sgd_minibatch_size > max_seq_len
1 Like

Hi @mannyv , I have tested in both 1.8.0 and 2.0.0.dev0 with the following configuration:

and the hint you gave me does not solve the problem. I have same a similar behaviour:

It looks it now uses action 1 (stay) but not really when it is better to use.
I am using 100 epochs, do you think training more would be better?

Just trained 250 epochs and it seems better:

1 Like


That is looking a lot better. Still some room for improvement though. How do the rewards compare to your hardcoded strategies with future knowledge?

Not sure yet @mannyv . I am trying to analyze the results.

I think results are not so bad. One more question @mannyv regarding inference. Is this part correct?

Or it should be more similar to this?