I recently came across a paper about doing semantic segmentation using deconvolutional network: Learning Deconvolution Network for Semantic Segmentation.
The basic structure of the network is like this:
The goal is to generate a probability map in the end. I'm having trouble figuring out how to realize the deconvolution layer. In the paper, it says:
The output of an unpooling layer is an enlarged, yet sparse activation map. The deconvolution layers densify the sparse activations obtained by unpooling through convolution-like operations with multiple learned filters. However, contrary to convolutional layers, which connect multiple input activations within a filter window to a single activation, deconvolutional layers associate a single input activation with multiple outputs.
The output of the deconvolutional layer is an enlarged and dense activation map. We crop the boundary of the enlarged activation map to keep the size of the output map identical to the one from the preceding unpooling layer.
The learned filters in deconvolutional layers correspond to bases to reconstruct shape of an input object. Therefore, similar to the convolution network, a hierarchical structure of deconvolutional layers are used to capture different level of shape details. The filters in lower layers tend to capture overall shape of an object while the class-specific fine details are encoded in the filters in higher layers. In this way, the network directly takes class-specific shape information into account for semantic segmentation.
Can anyone explain how the deconvolution works? I'm guessing it's not a simple interpolation.