Trying to understand model and env concepts

i’ve been trying to understand and use RLlib in the past weeks, but I can’t understand some basic concepts of the framework.
My goal is to have a multiagent environment, and to share some informations about state and actions of other agents.
The main examples i’m looking at are example and
In both examples there is some data that i do not understand how is it passed between the model and the env.
I’m talking for example about previous rewards/obs/actions in the trajectory example model ray/ at 62f5da0b65dab0174d5ee0276a693df0ca2b958c · ray-project/ray · GitHub
and also about the own_obs/opponent_obs in the centralized model ray/ at 62f5da0b65dab0174d5ee0276a693df0ca2b958c · ray-project/ray · GitHub.

I expected those observations to be referenced by the environments in the examples, but i see nothing. Am I missing something?

  1. For the trajectory view example:
    Trajectory view is not passed between model and env. Trajectory view is more like a specification that you specify with a model.

Think about it this way. When you interact with your env, you only get the latest state or reward out, that is your observation and reward.
But for RNN or attention models, what you feed into the NN is a list of observations and rewards, usually those from the last n steps.

So somewhere in RLlib, we have to save the data from previous steps. And we need to know how much data in history we need to keep, and how to stack them.
That is basically ViewRequirement.

The few lines you pointed out, we are creating ViewRequirement for the last num_frames steps. And when we collect an episode from your Env, although each step only has 1 obs and 1 reward, we know that we need to stack things properly for your Model to be trained.

For example, in, you see sometimes we would stack previous things together before running the forward() pass.

  1. For the multi-agent example, you are right, I think the model assumes that your Env will return a dict of obs. One entry for your agent, and one entry for the opponent agent.

And you can do whatever you want with them, for example, concatenate and feed them into a centralized value function, so your agent will be aware of the global situation. Or only feed obs into your policy, so it can only make decisions based on its own situation, etc.


Thanks for the answer!
Guess i have to understand better what are the roles of model and env then.
Let’s take the cartpole/trajectory example, the models here

are building an FC model and using view_requirements/input dict to have access the the previous observations.

What’s the role of the env then?
Especially of the step function here:

Here’s what I understand; hopefully that helps clear things up:

  • The frame-stacking model defines the model, i.e., what the used NN looks like (some dense layers) and how inputs are passed to the model to make predictions in the forward function. Here, it just says that the model not just expects a single observation, action, and reward but multiple (num_frames) many of these experiences concatenated. The model just defines the shape of expected inputs and outputs (+ NN structure); it does not define what observations, actions, rewards exactly look like.
  • The exact observations, actions, rewards depend on the environment, which represents the world in which the RL agent(s) live and act. Environments typically follow the OpenAI Gym interface, where step() is a core function of the environment implementing the environment dynamics. Specifically, the function takes an action from an RL agent(s) and defines how the environment changes with that action. As a result, it returns one observation and reward (per agent).
  • In the stateless cartpole environment, step() basically calls the step() function of the normal, parent cartpole environment but only returns reduced observations that do not include velocity but just the current position of the cartpole (i.e., stateless).
  • With just the position of the cartpole, the RL agent has no chance of knowing whether the pole is swinging up, down or is rather stable - but this makes a huge difference. In this and in similar environments, where some state is missing but important for decisions, the history of previous experiences is important (here, the previous positions of the pole). This is where the frame-stacked model comes into play and automatically keeps the last num_frames experiences and passes them to the NN (instead of just the last position).

Does this help?

1 Like

Thanks, that was helpful!
In the meantime i actually find out the OpenAI Gym documentation about envs, that indeed helped to understand better the role of the env and what it does/defines.

I still miss a quite important point in the overall architecture:
From my actual understanding the Env defines HOW ( space of actions and observations, rewards, done, etc) and Model manages layers of the NN i/o. (more or less)
Let’s suppose i want do define something like:
If there’s an X observation, do action Y.
Where is a piece of code like that supposed to be?

Great, I’m happy to help :slight_smile:

Let’s suppose i want do define something like:
If there’s an X observation, do action Y.
Where is a piece of code like that supposed to be?

That’s what you agent does. You could write your own agent, that follows some rules/algorithm to decide action Y should happen after observation X. Or you could build an RL agent (with RLlib) that learns itself which action to choose after observation X (using compute_action(observation); or compute_single_action(observation)).
Either way, deciding which action to pick is the logic that goes into the agent, which interacts with the environment: Env provides an observation and reward, agent picks an action, env applies action and provides next obs and reward - and so on.

1 Like

Agent is actually an entity that i have quite clear on theoric RL, but I can’t find documentation or examples about it on RLlib.
From my understanding what is called in theory “agent” is kinda overlapped by the Trainer in RLlib, am I right?

Hi @Luca_Pretini,

In rllib the policy class is what defines the actions taken given an observation.

The trainer manages the entire process. It takes the environment and all the policies and specifies their interactions. How often and how many new samples are obtained, where they are stored, how they are used to update the policy, etc.