Trying to understand model and env concepts

Luca_Pretini · September 24, 2021, 2:38pm

Hi,
i’ve been trying to understand and use RLlib in the past weeks, but I can’t understand some basic concepts of the framework.
My goal is to have a multiagent environment, and to share some informations about state and actions of other agents.
The main examples i’m looking at are centralized_critic2.py example and trajectory_view_api.py.
In both examples there is some data that i do not understand how is it passed between the model and the env.
I’m talking for example about previous rewards/obs/actions in the trajectory example model ray/trajectory_view_utilizing_models.py at 62f5da0b65dab0174d5ee0276a693df0ca2b958c · ray-project/ray · GitHub
and also about the own_obs/opponent_obs in the centralized model ray/centralized_critic_models.py at 62f5da0b65dab0174d5ee0276a693df0ca2b958c · ray-project/ray · GitHub.

I expected those observations to be referenced by the environments in the examples, but i see nothing. Am I missing something?

gjoliver · September 24, 2021, 7:44pm

For the trajectory view example:
Trajectory view is not passed between model and env. Trajectory view is more like a specification that you specify with a model.

Think about it this way. When you interact with your env, you only get the latest state or reward out, that is your observation and reward.
But for RNN or attention models, what you feed into the NN is a list of observations and rewards, usually those from the last n steps.

So somewhere in RLlib, we have to save the data from previous steps. And we need to know how much data in history we need to keep, and how to stack them.
That is basically ViewRequirement.

The few lines you pointed out, we are creating ViewRequirement for the last num_frames steps. And when we collect an episode from your Env, although each step only has 1 obs and 1 reward, we know that we need to stack things properly for your Model to be trained.

For example, in recurrent_net.py, you see sometimes we would stack previous things together before running the forward() pass.

github.com

ray-project/ray/blob/master/rllib/models/tf/recurrent_net.py#L206-L219

    
      
          if self.model_config["lstm_use_prev_action"]:
              prev_a = input_dict[SampleBatch.PREV_ACTIONS]
              if isinstance(self.action_space, (Discrete, MultiDiscrete)):
                  prev_a = one_hot(prev_a, self.action_space)
              prev_a_r.append(
                  tf.reshape(tf.cast(prev_a, tf.float32), [-1, self.action_dim]))
          if self.model_config["lstm_use_prev_reward"]:
              prev_a_r.append(
                  tf.reshape(
                      tf.cast(input_dict[SampleBatch.PREV_REWARDS], tf.float32),
                      [-1, 1]))
          
          
if prev_a_r:
              wrapped_out = tf.concat([wrapped_out] + prev_a_r, axis=1)

For the multi-agent example, you are right, I think the model assumes that your Env will return a dict of obs. One entry for your agent, and one entry for the opponent agent.

github.com

ray-project/ray/blob/62f5da0b65dab0174d5ee0276a693df0ca2b958c/rllib/examples/models/centralized_critic_models.py#L59

    
      
                      ]), [-1])
          
          
    @override(ModelV2)
              def value_function(self):
                  return self.model.value_function()  # not used
          
          

          
class YetAnotherCentralizedCriticModel(TFModelV2):
              """Multi-agent model that implements a centralized value function.
          
          
    It assumes the observation is a dict with 'own_obs' and 'opponent_obs', the
              former of which can be used for computing actions (i.e., decentralized
              execution), and the latter for optimization (i.e., centralized learning).
          
          
    This model has two parts:
              - An action model that looks at just 'own_obs' to compute actions
              - A value model that also looks at the 'opponent_obs' / 'opponent_action'
                to compute the value (it does this by using the 'obs_flat' tensor).
              """
          
          
    def __init__(self, obs_space, action_space, num_outputs, model_config,

And you can do whatever you want with them, for example, concatenate and feed them into a centralized value function, so your agent will be aware of the global situation. Or only feed obs into your policy, so it can only make decisions based on its own situation, etc.

Luca_Pretini · September 25, 2021, 3:33pm

Thanks for the answer!
Guess i have to understand better what are the roles of model and env then.
Let’s take the cartpole/trajectory example, the models here

github.com

ray-project/ray/blob/65fa740c3b8f27803715521524c4d2245f3b0517/rllib/examples/models/trajectory_view_utilizing_models.py

from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.torch.misc import SlimFC
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.policy.view_requirement import ViewRequirement
from ray.rllib.utils.framework import try_import_tf, try_import_torch
from ray.rllib.utils.tf_ops import one_hot
from ray.rllib.utils.torch_ops import one_hot as torch_one_hot

tf1, tf, tfv = try_import_tf()
torch, nn = try_import_torch()

# __sphinx_doc_begin__


class FrameStackingCartPoleModel(TFModelV2):
    """A simple FC model that takes the last n observations as input."""

    def __init__(self,
                 obs_space,
                 action_space,

This file has been truncated. show original

are building an FC model and using view_requirements/input dict to have access the the previous observations.

What’s the role of the env then?
Especially of the step function here:

github.com

ray-project/ray/blob/65fa740c3b8f27803715521524c4d2245f3b0517/rllib/examples/env/stateless_cartpole.py#L30

    
      
              # Fix our observation-space (remove 2 velocity components).
              high = np.array(
                  [
                      self.x_threshold * 2,
                      self.theta_threshold_radians * 2,
                  ],
                  dtype=np.float32)
          
          
    self.observation_space = Box(low=-high, high=high, dtype=np.float32)
          
          
def step(self, action):
              next_obs, reward, done, info = super().step(action)
              # next_obs is [x-pos, x-veloc, angle, angle-veloc]
              return np.array([next_obs[0], next_obs[2]]), reward, done, info
          
          
def reset(self):
              init_obs = super().reset()
              # init_obs is [x-pos, x-veloc, angle, angle-veloc]
              return np.array([init_obs[0], init_obs[2]])

stefanbschneider · September 28, 2021, 4:17pm

Here’s what I understand; hopefully that helps clear things up:

The frame-stacking model defines the model, i.e., what the used NN looks like (some dense layers) and how inputs are passed to the model to make predictions in the forward function. Here, it just says that the model not just expects a single observation, action, and reward but multiple (num_frames) many of these experiences concatenated. The model just defines the shape of expected inputs and outputs (+ NN structure); it does not define what observations, actions, rewards exactly look like.
The exact observations, actions, rewards depend on the environment, which represents the world in which the RL agent(s) live and act. Environments typically follow the OpenAI Gym interface, where step() is a core function of the environment implementing the environment dynamics. Specifically, the function takes an action from an RL agent(s) and defines how the environment changes with that action. As a result, it returns one observation and reward (per agent).
In the stateless cartpole environment, step() basically calls the step() function of the normal, parent cartpole environment but only returns reduced observations that do not include velocity but just the current position of the cartpole (i.e., stateless).
With just the position of the cartpole, the RL agent has no chance of knowing whether the pole is swinging up, down or is rather stable - but this makes a huge difference. In this and in similar environments, where some state is missing but important for decisions, the history of previous experiences is important (here, the previous positions of the pole). This is where the frame-stacked model comes into play and automatically keeps the last num_frames experiences and passes them to the NN (instead of just the last position).

Does this help?

Luca_Pretini · September 29, 2021, 8:45am

Thanks, that was helpful!
In the meantime i actually find out the OpenAI Gym documentation about envs, that indeed helped to understand better the role of the env and what it does/defines.

I still miss a quite important point in the overall architecture:
From my actual understanding the Env defines HOW ( space of actions and observations, rewards, done, etc) and Model manages layers of the NN i/o. (more or less)
Let’s suppose i want do define something like:
If there’s an X observation, do action Y.
Where is a piece of code like that supposed to be?

stefanbschneider · September 29, 2021, 9:01am

Great, I’m happy to help

Let’s suppose i want do define something like:
If there’s an X observation, do action Y.
Where is a piece of code like that supposed to be?

That’s what you agent does. You could write your own agent, that follows some rules/algorithm to decide action Y should happen after observation X. Or you could build an RL agent (with RLlib) that learns itself which action to choose after observation X (using compute_action(observation); or compute_single_action(observation)).
Either way, deciding which action to pick is the logic that goes into the agent, which interacts with the environment: Env provides an observation and reward, agent picks an action, env applies action and provides next obs and reward - and so on.

Luca_Pretini · September 29, 2021, 9:21am

Agent is actually an entity that i have quite clear on theoric RL, but I can’t find documentation or examples about it on RLlib.
From my understanding what is called in theory “agent” is kinda overlapped by the Trainer in RLlib, am I right?

mannyv · September 29, 2021, 10:34am

Hi @Luca_Pretini,

In rllib the policy class is what defines the actions taken given an observation.

https://docs.ray.io/en/latest/rllib-package-ref.html?highlight=Policy#ray.rllib.policy.Policy

The trainer manages the entire process. It takes the environment and all the policies and specifies their interactions. How often and how many new samples are obtained, where they are stored, how they are used to update the policy, etc.

Topic		Replies	Views
Potential bug in trajectory view API for multiagent envs RLlib	6	714	February 12, 2021
Trajectory View API Example RLlib	4	393	February 8, 2023
States of Recurrent models for multiple workers/envs RLlib	1	302	April 14, 2021
Save played trajectories in memory RLlib	1	423	August 17, 2022
Saving episode trajectories during training RLlib	0	219	July 13, 2023

Trying to understand model and env concepts

Related topics