I want to train a policy using a vision network with a custom environment. My input (image) shape is (41,41,1). The last dimension is kept for allowing me to implement different image channels later. For now i only use a grayscale image.
When trying to configure the conv_filter in the policy section in the trainer config to fit my custom input i ran into some understanding problems, that are probably misconceptions on my part. Maybe someone could help clarify this.
The docs provide an example for a 42 by 42 image:
"model": {"dim": 42, "conv_filters": [[16, [4, 4], 2], [32, [4, 4], 2], [512, [11, 11], 1]]}
The catalog.py explains the parameters:
# Filter config: List of [out_channels, kernel, stride] for each filter.
In my understanding the first argument for each filter (called “out_channels”) is the number of the filters defining the depth of the activation it outputs. But this would mean that the last filter in the example is incorrect.
Input shape: (42,42,1)
Is it correct that the last dimension doesn’t really matter for the dimensionality of the output since the kernel “negates” it anyways by using a equivalent depth?
Filter 1: kernel size =4, stride = 2, out_channels/num filters = 16
shape of activation 1: (20,20,16)
Filter 2: kernel size =4, stride = 2, out_channels/num filters = 32
Shape of activation 2: (9,9,32)
Filter 3: kerne size = 11, stride =1, out_channels/num filters = 512
wouldn’t this filter be invalid since the first two dimensions would be negative?
To verify i adapted following code from the keras documentation:
import tensorflow as tf
# shape = (batch_size, height, width, channels)
input_shape = (4, 42, 42, 3)
a = tf.random.normal(input_shape)
b = tf.keras.layers.Conv2D(16, 4,strides=(2,2), activation='relu', input_shape=input_shape[1:])(a)
c = tf.keras.layers.Conv2D(32, 4,strides=(2,2), activation='relu', input_shape=input_shape[1:])(b)
d = tf.keras.layers.Conv2D(512, 11,strides=(1,1), activation='relu', input_shape=input_shape[1:])(c)
This code throws following error: “ValueError: One of the dimensions in the output is <= 0 due to downsampling in conv2d_2. Consider increasing the input size. Received input shape [4, 9, 9, 32] which would produce output shape with a zero or negative value in a dimension.”
Question: Where am i doing something wrong or is the example incorrect? An how is this configuration able to match the requirement of an output that has a shape of (B, 1, 1, X) as described in the docs?
from the docs:
“Thereby, always make sure that the last Conv2D output has an output shape of [B, 1, 1, X]
( [B, X, 1, 1]
for PyTorch), where B=batch and X=last Conv2D layer’s number of filters, so that RLlib can flatten it.”
Thanks for any help and hints!