Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Video scene understanding engulfs several fundamental and challenging computer vision tasks that complement each other. Some of them are inherently reasoning on a set of consecutive images, while others can be tackled on each frame separately. In this thesis, we focus on a subset of these tasks, starting from a global scene understanding perspective with semantic segmentation, to finish on a more local one with Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Within that scope, we investigate different means to leverage and combine temporal cues to improve scene understanding algorithms when processing videos.
More specifically, we start by analyzing in the first part how the spatiotemporal correlations found in videos can be used to either increase the frame rate or the accuracy of single-frame semantic segmentation methods. First, we use optical flow as a means to propagate semantic information across frames, and build a pipeline for real-time video semantic segmentation that balances the computation load between GPU and CPU. Instead of designing a heavy neural network that infers everything on the GPU, we propose to focus the GPU task on either predicting segmentation masks from scratch or refining propagated labels. At the same time, a fast optical flow running on the CPU provides the motion vectors to warp semantic labels and features from one frame to the next. The refinement is done by a lightweight module that considers potential optical flow mistakes. We propose several operating points offering different trade-offs between speed and accuracy, and observe that our approach can lead to massive speedups at the price of a small drop in segmentation accuracy.
Then, we propose to directly exploit temporal correlations and appearance cues without an additional optical flow module. To achieve this, we aggregate semantic information from previous frames in a memory module that can be used through attention mechanisms. We design our pipeline to first access the deep features from past frames stored in memory and match them in a local neighborhood around each pixel. These spatio-temporal cues are afterward fused together with the current frame encoding to improve the final segmentation prediction. Our approach introduces a set of simple yet generic modules which can convert virtually any existing single-frame method to a video pipeline. We demonstrate the improvements of our architecture in terms of segmentation accuracy on two popular single-frame semantic segmentation networks.
In the second part, we shift our focus to the tasks of tracking and segmenting single objects in a video and hope to bridge the gap between the two. We especially study how they are related and expose the benefits of working with segmentation masks in the context of VOT. To that end, we propose a segmentation-centric approach which, in contrast with most existing approaches, internally works with segmentation masks and predicts segmentation masks without the need for an additional module. A dedicated instance localization branch inspired by existing trackers is used to bring the necessary robustness for VOT challenges and to condition the segmentation decoder to predict the correct segmentation mask. We show that our unified architecture yields state-of-the-art results compared to other trackers, both in terms of robustness and accuracy, while generating accurate segmentation masks. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000628365Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Van Gool, Luc
Examiner: Leibe, Bastian
Examiner: Felsberg, Michael
Examiner: Dragon, Ralf
Publisher
ETH ZurichSubject
Computer Vision; Deep Learning; Semantic Segmentation; Visual Object Tracking; Video Object SegmentationOrganisational unit
03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)
More
Show all metadata
ETH Bibliography
yes
Altmetrics