Meaning of inputs in VisionNet

Hi all, I have a simple question.

I’m trying to understand how the VisionNetwork model works. It is used by default when training with atari GYM envs, like Pong-v0. I have seen that the observations produced by this env have a 210x260x3 shape, that are downscaled or upscaled by the usage of ray.rllib.models.preprocessors.GenericPixelPreprocessor and they get a (dim,dim,3) shape. So when inspecting the base keras model created I observe that the inputs layer has a (84,84,4) shape. So, what does this “extra layer” mean? How are layers sizes defined, because I saw that in the visionnet.py class the inputs layer (“observations”) is created by using a obs_space shape, but I’m not able to find what does this refer to. In addition I will thnak any reference to an explanation of the whole model and how it really receives images and which outputs produces. Thank you so much in advance!

Great question @javigm98 ! Actually, when using Atari envs, RLlib automatically uses the “deepmind” preprocessor stack, which consists of a series of [NoopReset, EpisodicLife, FireReset, WarpFrame (does the grayscaling), Framestack (4 last obs stacked)] baked-in preprocessors. These make sure, the actual output from the env has the format: [84, 84, 4] ← note that the color axis has been replaced by the 4x past observation axis, so instead of RGB colors, the image is a 4-stack of the last n gray scaled images (84x84). This “input” layer is just the tensor being produced by your env now (note that in deep learning, we sometimes say “input layer”, but it’s not an actual layer, but simply the input data and that we have not processed anything via a learnable NN yet, just applied some static preprocessing logic to the raw env’s outputs (210x260x3)).

This input data (84x84x4) then gets fed through the first VisionNet layer (a Conv2D layer) and - successively - the following 2 Conv2D layers. You can change these total of 3 Conv2D layers by providing the “conv2d_filters” option in your “model” sub-config and setting up different convolutional kernel sizes, stride sizes, and padding options therein. However, be aware that changing these values can often break the convolutional stack. See rllib/models/utils.py::get_filter_config for our default, given different preprocessed image sizes.

Hope this helps a little :slight_smile:

1 Like

Hi @sven1977 and thanks for your answer, it was so helpful! So in this case, how can I access to this preprocessor? I mean, where is it implemented or how can I preprocess “by hand” images. I mean, I’m trying to collect a dataset of input data to fed a keras model like the one created by RLlib for AtariEnvs, so as I can get the gym env observations I would like to preprocess them in the same way that RLLib does, and by doing that, be able to transform the (210,160,3) gym obervations in (1,84,84,4) shape. So I was trying to look for the way in which RLlib did that, but I wasn’t able to achieve it. So, in fact, what i only want to know is a way or a method to trasnform (210,160,3) gym observations into (1,84,84,1) data (as it is done in RLib).

Thank you so much in advance!