Spatio-Temporal Learning for Video Segmentation and Tracking

Paul, Matthieu

doi:10.3929/ethz-b-000628365

Download

Full text (PDF, 79.58Mb)

Open access

Author

Paul, Matthieu

Date

2023

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 79.58Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Video scene understanding engulfs several fundamental and challenging computer vision tasks that complement each other. Some of them are inherently reasoning on a set of consecutive images, while others can be tackled on each frame separately. In this thesis, we focus on a subset of these tasks, starting from a global scene understanding perspective with semantic segmentation, to finish on a more local one with Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Within that scope, we investigate different means to leverage and combine temporal cues to improve scene understanding algorithms when processing videos. More specifically, we start by analyzing in the first part how the spatiotemporal correlations found in videos can be used to either increase the frame rate or the accuracy of single-frame semantic segmentation methods. First, we use optical flow as a means to propagate semantic information across frames, and build a pipeline for real-time video semantic segmentation that balances the computation load between GPU and CPU. Instead of designing a heavy neural network that infers everything on the GPU, we propose to focus the GPU task on either predicting segmentation masks from scratch or refining propagated labels. At the same time, a fast optical flow running on the CPU provides the motion vectors to warp semantic labels and features from one frame to the next. The refinement is done by a lightweight module that considers potential optical flow mistakes. We propose several operating points offering different trade-offs between speed and accuracy, and observe that our approach can lead to massive speedups at the price of a small drop in segmentation accuracy. Then, we propose to directly exploit temporal correlations and appearance cues without an additional optical flow module. To achieve this, we aggregate semantic information from previous frames in a memory module that can be used through attention mechanisms. We design our pipeline to first access the deep features from past frames stored in memory and match them in a local neighborhood around each pixel. These spatio-temporal cues are afterward fused together with the current frame encoding to improve the final segmentation prediction. Our approach introduces a set of simple yet generic modules which can convert virtually any existing single-frame method to a video pipeline. We demonstrate the improvements of our architecture in terms of segmentation accuracy on two popular single-frame semantic segmentation networks. In the second part, we shift our focus to the tasks of tracking and segmenting single objects in a video and hope to bridge the gap between the two. We especially study how they are related and expose the benefits of working with segmentation masks in the context of VOT. To that end, we propose a segmentation-centric approach which, in contrast with most existing approaches, internally works with segmentation masks and predicts segmentation masks without the need for an additional module. A dedicated instance localization branch inspired by existing trackers is used to bring the necessary robustness for VOT challenges and to condition the segmentation decoder to predict the correct segmentation mask. We show that our unified architecture yields state-of-the-art results compared to other trackers, both in terms of robustness and accuracy, while generating accurate segmentation masks. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000628365

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Van Gool, Luc
Examiner: Leibe, Bastian
Examiner: Felsberg, Michael
Examiner: Dragon, Ralf

Publisher

ETH Zurich

Subject

Computer Vision; Deep Learning; Semantic Segmentation; Visual Object Tracking; Video Object Segmentation

Organisational unit

03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Spatio-Temporal Learning for Video Segmentation and Tracking Mendeley CSV RIS BibTeX

Spatio-Temporal Learning for Video Segmentation and Tracking

Mendeley

CSV

RIS

BibTeX