Passing Colour Image Into Default Mode

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.


I am looking to use screen capture is input into a PPO+Curiosity module (the ones built into Ray).

However, I came across this:

# VisionNetwork (tf and torch):|
    # These are used if no custom model is specified and the input space is 2D.
    # Filter config: List of [out_channels, kernel, stride] for each filter.
    # Example:
    # Use None for making RLlib try to find a default filter setup given the
    # observation space.
    "conv_filters": None,
    # Activation function descriptor.
    # Supported values are: "tanh", "relu", "swish" (or "silu"),
    # "linear" (or None).
    "conv_activation": "relu",

    # Some default models support a final FC stack of n Dense layers with given
    # activation:
    # - Complex observation spaces: Image components are fed through
    #   VisionNets, flat Boxes are left as-is, Discrete are one-hot'd, then
    #   everything is concated and pushed through this final FC stack.
    # - VisionNets (CNNs), e.g. after the CNN stack, there may be
    #   additional Dense layers.
    # - FullyConnectedNetworks will have this additional FCStack as well
    # (that's why it's empty by default).
    "post_fcnet_hiddens": [],
    "post_fcnet_activation": "relu",

I have a few things, let’s say my image is 128x128x3 (because it is colour) and I want to preserve that colour:

  1. “conv_filters”: am I correct in understanding that I would want something like this:
    "conv_filters": [[3, [5,5], 1], [3,[3,3],2]] → This would take my 128x128x3 and output using the equation from here:
1. 128 x 128 x 3
2. 124 x 124 x 3
3. 61.5 x 61.5 x 3

Is that correct?

  1. Regarding “post_fcnet_hiddens”, this would come after the above convulational layers? So let’s say I want a final fcnet of: [128, 128]. Would it automatically resize the output (61.5x61.5x3) to fit into the 128 or do I need to do something else?

  2. What if I want to add max pooling layers for the convolutions, is there a way I can do that or do I have to do it myself externally? In which case I may as well pre-process the image input myself. If I do preprocess the image myself, what kind of input exactly does Ray want? Is it a tensor? Or can it be a numpy array (much, much preferred)?

  3. What kind of convolution is applied? I understand we specify the stride, kernel, etc. But is it avg? Max?

I realise it’s a lot of questions, but any and all feedback is appreciated!

Hey @Denys_Ashikhin , thanks for posting this question!

  1. Yes, this format is the correct one for specifying your CNN layers features. Each item in the list represents one layer and each item is yet another list of [num-filters, kernel-size, stride].

  2. Yes, correct. You don’t really need to do anything for the post_fcnet_hiddens, except that you have to make sure your CNN net outputs are flat at the end. If not, though, you would get an error like this and then only have to adjust your CNN settings:

Given `conv_filters` (...) do not result in a [B, 1, 1, `num_outputs`] shape (but in ...)! Please
adjust your Conv2D stack such that the dims 1 and 2 are both 1.
  1. You can write your own model, which include pooling layers. Take a look at the custom model example here: ray/rllib/examples/ The observations coming from your env may be simple numpy arrays (3D arrays e.g. for images). RLlib will automatically convert the data coming from the env into the respective tensors.

  2. We simply use tf’s and torch’s default CNN layers, e.g. tf.keras.layers.Conv2D  |  TensorFlow Core v2.8.0
    I’m not sure about max and avg. Maybe you are confusing it with pooling layers (which we don’t use by default).

That makes sense so far! I have a couple follow up questions,

  1. I found example convolutions here, for example:
filters_480x640 = [
        [16, [24, 32], [14, 18]],
        [32, [6, 6], 4],
        [256, [9, 9], 1],

If we take just the height (480px) → it goes something like:

  • 33.57142857142857
  • 7.892857142857142
  • -0.10714285714285765
    Does each step get reduced with a floor operation? To make the numbers work out nicely? And it seems like the trend is to get the final value to 0 not 1 - I feel like I might be missing something here
  1. Does the order in which our image is fed in matter? Like width first or height first (i.e. 480 x 640 vs 640 x 480) assuming we keep our convolution accurate to make sure the final result is 0/1 (based on answer above)?

  2. Perhaps outside the scope of this question, but do we want to normalise the input values of image (I know that each value of rgb is 0-255 so I can scale it easily in advance)

Thanks again!

Hey @Denys_Ashikhin , thanks for following up. I’m not sure I understand the first question:
There is no floor operation as far as I know inside a plain keras CNN layer. Consider each configured filter, e.g. [24, 32] as a sliding window (smaller than the actual image) and gets slided over the image from top to bottom AND left to right (using the strides, e.g. [14, 18]) and each stride multiplying the value of this sliding window with the actual image pixel values. This will then reduce the input image into a smaller one. You do this e.g. 16 times (16 different such filters) to obtain your input to the next filter (e.g. [32, [6, 6], 4]). Sorry, but you should read how convolutional layers in more detail and then I think it makes more sense to you what the intermediary values are that RLlib/tf/keras produces between the layers and at the end.

  1. shouldn’t matter as along as the filter settings you use match. E.g. your image is h x w, then filter settings would need to be interpreted as:
filters_480x640 = [
        [16, [24, 32] <- h=24, w=32, [14, 18] <- h=14, w=18],
        [32, [6, 6] <- h=6, w=6, 4 <- both h&w=4],
        [256, [9, 9] <- h=9, w=9, 1 <- both h&w=1],
  1. Yes, normalization is always a good idea! You could for example divide your raw pixel values simply by 255 (normalize between 0 and 1.0) OR Divide by 128 and then subtract 1.0 (normalize between -1.0 and 1.0). Both are accepted ways of normalization.

Hi @sven1977,

I think I wasn’t clear enough for 1, however, for 2 and 3 the answer is perfect!
For 1, what I mean is I don’t understand how we arrive at certain values, let’s just take one dimension for now.

480 and a sliding window of 24, with a stride of 14. No matter how I mix these numbers (in a realistic) manner I cannot get to 6. I think my misunderstanding is stemming from the final result being 6? OR how we use 24 and 14 to cover 480 perfectly without any remainders?

Intuitively, for me, we would use a window and a stride that covers the length perfectly, without any remainders.

For example: length of 5. We can use a sliding window of 1 and a stride of 5. Or a window of 5 and a stride of 1. But it doesn’t make sense to me how would we use say, a window of 2 and a stride of 3 since those combinations don’t fit nicely into the overall length.

That’s why I threw out something like floor, to handle those remainders. I will definitely look into this on my own some more, but if you could provide something simple (i.e. it doesn’t have to fit nicely and read into it, or it does have to fit nicely using the following formula…) that’d be really appreciated!