drozzy  
                
                  
                    April 1, 2021,  4:18am
                   
                  1 
               
             
            
              I’m surprised no one else complains about this… but it is near impossible to try out a trained policy (or at least I can’t find ANY docs about this).
Basically training is easy, but stepping through the environment and feeding it an action selected by policy is not documented anywhere:
Here is an issue I filed about this:
  
  
    
  
  
    
    
      
        opened 12:24AM - 02 Mar 21 UTC 
      
        
          closed 02:36PM - 19 May 21 UTC 
        
      
     
    
        enhancement 
        rllib 
    
   
 
  
    I know this might be a duplicate, but there is still no clear section in the doc… s explaining how to do a simple rollout/render from a trained policy.
What I'm talking about is **code-based** (not command line based) equivalent of [stable baselines example like this](https://stable-baselines.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading):
<img width="647" alt="Screen Shot 2021-03-01 at 7 20 19 PM" src="https://user-images.githubusercontent.com/140710/109577178-2ac99f00-7ac3-11eb-8408-a878ec8a1143.png">
In constrast, rllib provides only [this huge file that is really hard to understand](https://github.com/ray-project/ray/blob/master/rllib/rollout.py#L25), and can only be used via cli:
<img width="549" alt="Screen Shot 2021-03-01 at 7 21 52 PM" src="https://user-images.githubusercontent.com/140710/109577305-62d0e200-7ac3-11eb-9ca6-c078c2eae2d4.png">
It would be nice to have an example of how to access a trained policy itself, so that we can write a simple render loop like in stable-baselines above. 
   
   
  
    
    
  
  
 
             
            
              
            
           
          
            
            
              Hi @drozzy  I feel your pain. I came from stable_baselines too. I just wrote a runnable script to try out a trained policy below. It’s for multi-agent but can be easily modified for single agent. I think the docs has an example for single agent, but I couldn’t remember where atm. Cheers,
  
  
    import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray.rllib.examples.env.random_env import RandomMultiAgentEnv
num_agents = 2
config = ppo.DEFAULT_CONFIG.copy()
config["num_workers"] = 1
config["env_config"] = {
  "num_agents" : num_agents,
}
env = RandomMultiAgentEnv(config["env_config"])
config["multiagent"] = {
  "policies" : { # (policy_cls, obs_space, act_space, config)
    "{}".format(x): (None, env.observation_space, env.action_space, {}) for …
   
 
Edit: there is definitely a higher learning curve for RLlib than stable_baselines, imho. For my research work, I wish I had started with RLlib than stable_baselines.
Edit 2: the single agent version is here: Getting Started with RLlib — Ray 3.0.0.dev0 
             
            
              
            
           
          
            
            
              Hey @drozzy  , great point, and thanks for all your help on this @stefanbschneider  and @RickLan  .
For the LSTM and attention cases, you can also take a look at these example script, where these env loops are described in the comments:
ray.rllib.examples.attention_net.py and ray.rllib.examples.cartpole_lstm.py.