In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called NetWarp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the standard CamVid and Cityscapes benchmark datasets and show reliable improvements over different baseline networks.

Illustration of computations in NetWarp module: First, optical flow $F_t$ is computed between two video frames at time steps $t$ and $t-1$. Then the NetWarp module transforms the flow $\Lambda(F_t)$ with few convolutional layers; warps the activations $\mathbf{z}^k_{(t-1)}$ of the previous frame and and combines the warped representations with those of the present frame $\mathbf{z}^k_t$. The resulting representation $\widetilde{\mathbf{z}}^k_t$ is then passed onto the remaining CNN layers for semantic segmentation.



  title = {Semantic Video CNNs through Representation Warping},
  author = {Gadde, Raghudeep and Jampani, Varun and Gehler, Peter V.},
  booktitle = {IEEE International Conference on Computer Vision (ICCV)},
  year = {2017},