Correct specification for con_filters?

I want to train a policy using a vision network with a custom environment. My input (image) shape is (41,41,1). The last dimension is kept for allowing me to implement different image channels later. For now i only use a grayscale image.

When trying to configure the conv_filter in the policy section in the trainer config to fit my custom input i ran into some understanding problems, that are probably misconceptions on my part. Maybe someone could help clarify this.

The docs provide an example for a 42 by 42 image:

"model": {"dim": 42, "conv_filters": [[16, [4, 4], 2], [32, [4, 4], 2], [512, [11, 11], 1]]}

The catalog.py explains the parameters:

# Filter config: List of [out_channels, kernel, stride] for each filter.

In my understanding the first argument for each filter (called “out_channels”) is the number of the filters defining the depth of the activation it outputs. But this would mean that the last filter in the example is incorrect.

Input shape: (42,42,1)
Is it correct that the last dimension doesn’t really matter for the dimensionality of the output since the kernel “negates” it anyways by using a equivalent depth?

Filter 1: kernel size =4, stride = 2, out_channels/num filters = 16

shape of activation 1: (20,20,16)

Filter 2: kernel size =4, stride = 2, out_channels/num filters = 32

Shape of activation 2: (9,9,32)

Filter 3: kerne size = 11, stride =1, out_channels/num filters = 512
wouldn’t this filter be invalid since the first two dimensions would be negative?

To verify i adapted following code from the keras documentation:

import tensorflow as tf

# shape = (batch_size, height, width, channels)
input_shape = (4, 42, 42, 3)
a = tf.random.normal(input_shape)

b = tf.keras.layers.Conv2D(16, 4,strides=(2,2), activation='relu', input_shape=input_shape[1:])(a)
c = tf.keras.layers.Conv2D(32, 4,strides=(2,2), activation='relu', input_shape=input_shape[1:])(b)
d = tf.keras.layers.Conv2D(512, 11,strides=(1,1), activation='relu', input_shape=input_shape[1:])(c)

This code throws following error: “ValueError: One of the dimensions in the output is <= 0 due to downsampling in conv2d_2. Consider increasing the input size. Received input shape [4, 9, 9, 32] which would produce output shape with a zero or negative value in a dimension.”

Question: Where am i doing something wrong or is the example incorrect? An how is this configuration able to match the requirement of an output that has a shape of (B, 1, 1, X) as described in the docs?

from the docs:
“Thereby, always make sure that the last Conv2D output has an output shape of [B, 1, 1, X] ( [B, X, 1, 1] for PyTorch), where B=batch and X=last Conv2D layer’s number of filters, so that RLlib can flatten it.”

Thanks for any help and hints!

OK, while digging through the code for the vision net i found my mistake. In my simple keras implementation of the convolution part of the vision net i missed the padding settings that made my calculations wrong.

The first layers are added with padding=“same” while the last one is added with padding=“valid”.

So the working code that indeed has output shape (B,1,1,X) is this one:

import tensorflow as tf

input_shape = (4, 42, 42, 1)

a = tf.random.normal(input_shape)
b = tf.keras.layers.Conv2D(16, 4,strides=(2,2), padding="same", activation='relu', input_shape=input_shape[1:])(a)
c = tf.keras.layers.Conv2D(32, 4,strides=(2,2), padding="same", activation='relu', input_shape=input_shape[1:])(b)
d = tf.keras.layers.Conv2D(256, 11,strides=(1,1), padding="valid", activation='relu', input_shape=input_shape[1:])(c)

print(a.shape) # (4, 42, 42, 1)
print(b.shape) # (4, 21, 21, 16)
print(c.shape) # (4, 11, 11, 32)
print(d.shape) # (4, 1, 1, 256)