Note: I've read How do subsequent convolution layers work? a few times, but it's still difficult to understand because of the parameters $k_1$, $k_2$, and many proposals (1, 2.1, 2.2) in the question. This seems to be complex for manyother people too, I think I'm not the only one (see a few comments like "I have just struggled with this same question for a few hours"). So here it is formulated with a particular specific example with no parameters, to grasp the idea more easily.
Let's say we have a CNN with:
- input: 28x28x1 grayscale images (28x28 pixels, 1 channel)
- 1st convolutional layer with kernel size 3x3, and 32 features
- 2nd convolutional layer with kernel size 3x3, and 64 features
Keras implementation:
model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1))) model.add(Conv2D(64, (3, 3), activation='relu'))
Question: how does the 2nd layer work?
More precisely:
for the 1st layer:
- input size: (1, 28, 28, 1)
- weights size: (3, 3, 1, 32) (good to know: the number of weights doesn't depend on input pixel size)
- output size: (1, 26, 26, 32)
for the 2nd layer:
- input size: (1, 26, 26, 32)
- weights size: (3, 3, 1
132, 64) - output size: (1, 24, 24, 64)
How is the latter possible? It seemed to me that, in the 2nd layer, every input 26x26 image will be convolved with the 3x3 kernels of each 64 feature maps, but this could done for all the 32 channels! (input size for 2nd layer: (1, 26, 26, 32)
).
Thus I had the feeling the output of 2nd layer should be (1, 24, 24, 32*64)
How does it work here?