External On-Policy Actions in PPO

Hello everyone,

my colleagues and I are working on reinforcement learning controllers on
external hardware. Our starting point is a single RLlib agent operating in an
ExternalEnv on a workstation (WS). We extract its neural network and deploy it
on one or more external devices in order to gather on-policy experiences. Once
enough experiences have been collected, they are all sent back to the
workstation at once and used there for training. The updated policy network is
then transmitted to the external devices again. This cycle is repeated until
the agent meets certain performance criteria. Both the observation and action
space are continuous and the policy is trained with PPO. The aforementioned
training devices are supposed to operate independently and to keep
communication with the workstation to a minimum.

We’re running Ray 0.8.0 on the WS, though, as far as I can tell, the issue I’m
about to describe persists in the latest version. The policy is implemented
as a fully-connected, eager feed-forward TensorFlow network which is restored
and executed on the external devices using TensorFlow Lite Micro. There, we
interpret the model output as it’s done in DiagGaussian: The first half of
the output represents the means of the action distributions, the second the
natural logarithm of their standard deviations. The external devices log the
input of the neural network, the sampled actions according to the action
distributions (i.e. the chosen action), and the corresponding reward for every
training step.

The challenging part is making the agent train with the generated data. PPO is
an on-policy algorithm and the way the external devices sample actions from an
exact copy of the workstation’s neural network means that they are, in fact,
always on-policy. ExternalEnv’s log_action() method treats its argument as an
off-policy action, so that doesn’t work. Using offline datasets isn’t an
option either since they require logging additional information to work
properly. Given that our actions are always on-policy, the extra data
would be redundant and a waste of memory on the training devices.

Because of all that, we decided to patch RLlib such that actions logged with
ExternalEnv’s log_action() are treated as if they were on-policy (we dubbed
them “remote on-policy actions”). You can find the changes here[1]. The
patch assumes that the ExternalEnv processes episodes sequentially (i.e. no
concurrent episodes) and that it treats them like they were generated by a
single agent/device, regardless of origin. Our solution is a bit hacky, but
experiences should now be processed like this:

  1. Inside ExternalEnv’s run() method, experiences are logged using
    start_episode(), log_action(), log_returns(), and end_episode(). This writes
    the observation, reward, and remote on-policy action into the queue of an
    _ExternalEnvEpisode object.
  2. They are fetched by _ExternalEnvToBaseEnv.poll() when it’s called in
    _env_runner(). The remote on-policy actions are passed to
    _do_policy_eval() as a keyword argument which, in turn, passes them on to
  3. eager_policy_cls.compute_actions() checks if remote on-policy actions were
    passed. If so, the chosen action is set to the remote on-policy action. If
    not, an action is sampled. The logp is calculated explicitly in either case.

As it is right now, the changes are specific to our use-case and are likely to
break the code in other scenarios.

Experiments have shown that the code works as intended and results look good.
Nevertheless, I have to admit that tracing the flow of information inside
RLlib wasn’t easy, so we’d appreciate feedback from the community: Did we miss
any important steps or are there side effects we didn’t consider? Also, if you
think the concept of remote on-policy actions is worth integrating into the
official repository, we’d be glad to contribute a more elaborate version of the


[1] GitHub - kevbad/ray at remote_on_policy_actions

Hey Kevin, thanks for posting this interesting question. You are right in that for the on-policy case, one should never call log_action on the client side.

In my understanding, there are two ways of running on-policy algos via the ExternalEnv API:

  • inference_mode=local (set this inside the PolicyClient’s c’tor): In this mode, your local policy client will own its own policy, which it uses for action computation (every time, you call client.get_action(), it’ll query that local policy for an action). Then - after enough samples have been collected locally - a batch (including obs, rewards, actions, action-prob, etc…) is sent to the server for learning (there is no wasted data here, everything sent to the server is needed by the server for proper learning!).
    The problem here is that the server can do a proper on-policy update, using the sent samples, BUT the updated weights are only sent back to the client sporadically (every n seconds). ← this is currently a problem imho in RLlib’s external env API for on-policy algos that we need to fix!

What I’m understanding from your description is that you chose the other inference mode: “remote”, correct?

  • inference_mode=“remote”: Here, for each action, a query is sent to the server (via network) to calculate the action on the server and send it back to the client (same call: client.get_action()). In this mode, there shouldn’t be any problems with policies becoming outdated as there are no policy copies located on the client side.

Hi Sven,

the external training devices don’t run Ray/RLlib themselves, nor do they know
that it’s being used on the workstation. Instead, they receive a TensorFlow
neural network (each device gets its own copy), utilise it to choose actions in
order to interact with their environments, log experiences, and send them back
to the WS once a sufficient number has been gathered. Please note that said
environments are entirely different pieces of software that have nothing to do
with RLlib’s classes.

There hasn’t been any activity in this thread for a while now so I figured that
condensing the initial post could help people understand the core problem

The question is whether one can force-feed RLlib’s PPO implementation with
(observation, action, reward) tuples — provided that those are genuinely
on-policy — and have the algorithm train with that information alone. If not,
what data is internally generated when, for example, running an ExternalEnv in
remote inference mode that one would be missing?

The following link takes you right to the changes we made in order to pass
on-policy actions through ExternalEnv's log_action() method: