Hello everyone,
my colleagues and I are working on reinforcement learning controllers on
external hardware. Our starting point is a single RLlib agent operating in an
ExternalEnv on a workstation (WS). We extract its neural network and deploy it
on one or more external devices in order to gather on-policy experiences. Once
enough experiences have been collected, they are all sent back to the
workstation at once and used there for training. The updated policy network is
then transmitted to the external devices again. This cycle is repeated until
the agent meets certain performance criteria. Both the observation and action
space are continuous and the policy is trained with PPO. The aforementioned
training devices are supposed to operate independently and to keep
communication with the workstation to a minimum.
We’re running Ray 0.8.0 on the WS, though, as far as I can tell, the issue I’m
about to describe persists in the latest version. The policy is implemented
as a fully-connected, eager feed-forward TensorFlow network which is restored
and executed on the external devices using TensorFlow Lite Micro. There, we
interpret the model output as it’s done in DiagGaussian: The first half of
the output represents the means of the action distributions, the second the
natural logarithm of their standard deviations. The external devices log the
input of the neural network, the sampled actions according to the action
distributions (i.e. the chosen action), and the corresponding reward for every
training step.
The challenging part is making the agent train with the generated data. PPO is
an on-policy algorithm and the way the external devices sample actions from an
exact copy of the workstation’s neural network means that they are, in fact,
always on-policy. ExternalEnv’s log_action() method treats its argument as an
off-policy action, so that doesn’t work. Using offline datasets isn’t an
option either since they require logging additional information to work
properly. Given that our actions are always on-policy, the extra data
would be redundant and a waste of memory on the training devices.
Because of all that, we decided to patch RLlib such that actions logged with
ExternalEnv’s log_action() are treated as if they were on-policy (we dubbed
them “remote on-policy actions”). You can find the changes here[1]. The
patch assumes that the ExternalEnv processes episodes sequentially (i.e. no
concurrent episodes) and that it treats them like they were generated by a
single agent/device, regardless of origin. Our solution is a bit hacky, but
experiences should now be processed like this:
- Inside ExternalEnv’s run() method, experiences are logged using
start_episode(), log_action(), log_returns(), and end_episode(). This writes
the observation, reward, and remote on-policy action into the queue of an
_ExternalEnvEpisode object. - They are fetched by _ExternalEnvToBaseEnv.poll() when it’s called in
_env_runner(). The remote on-policy actions are passed to
_do_policy_eval() as a keyword argument which, in turn, passes them on to
policy.compute_actions(). - eager_policy_cls.compute_actions() checks if remote on-policy actions were
passed. If so, the chosen action is set to the remote on-policy action. If
not, an action is sampled. The logp is calculated explicitly in either case.
As it is right now, the changes are specific to our use-case and are likely to
break the code in other scenarios.
Experiments have shown that the code works as intended and results look good.
Nevertheless, I have to admit that tracing the flow of information inside
RLlib wasn’t easy, so we’d appreciate feedback from the community: Did we miss
any important steps or are there side effects we didn’t consider? Also, if you
think the concept of remote on-policy actions is worth integrating into the
official repository, we’d be glad to contribute a more elaborate version of the
above.
Yours,
Kevin
References:
[1] GitHub - kevbad/ray at remote_on_policy_actions