[RLlib] Make it easier to play trained policies

I’m surprised no one else complains about this… but it is near impossible to try out a trained policy (or at least I can’t find ANY docs about this).

Basically training is easy, but stepping through the environment and feeding it an action selected by policy is not documented anywhere:

Here is an issue I filed about this:

Hi @drozzy I feel your pain. I came from stable_baselines too. I just wrote a runnable script to try out a trained policy below. It’s for multi-agent but can be easily modified for single agent. I think the docs has an example for single agent, but I couldn’t remember where atm. Cheers,

Edit: there is definitely a higher learning curve for RLlib than stable_baselines, imho. For my research work, I wish I had started with RLlib than stable_baselines.

Edit 2: the single agent version is here: RLlib Training APIs — Ray v2.0.0.dev0

Hey @drozzy , great point, and thanks for all your help on this @stefanbschneider and @RickLan .
Yes, we should document this better.

For the LSTM and attention cases, you can also take a look at these example script, where these env loops are described in the comments:

ray.rllib.examples.attention_net.py and ray.rllib.examples.cartpole_lstm.py.