Tracking Generic Objects in Videos

Mayer, Christoph

doi:10.3929/ethz-b-000626971

Show simple item record

dc.contributor.author

Mayer, Christoph

dc.contributor.supervisor

Van Gool, Luc

dc.contributor.supervisor

Vedaldi, Andrea

dc.contributor.supervisor

Ling, Haibin

dc.contributor.supervisor

Danelljan, Martin

dc.date.accessioned

2023-08-17T12:52:20Z

dc.date.available

2023-08-16T18:39:48Z

dc.date.available

2023-08-17T11:12:00Z

dc.date.available

2023-08-17T12:27:05Z

dc.date.available

2023-08-17T12:52:20Z

dc.date.issued

2023

dc.identifier.uri

http://hdl.handle.net/20.500.11850/626971

dc.identifier.doi

10.3929/ethz-b-000626971

dc.description.abstract

Visual object tracking is a fundamental problem in computer vision and finds its application in multiple tasks such as autonomous driving, robotics, surveillance, video understanding, and sports analysis. Generic Object Tracking (GOT) is a specialized tracking task that aims at tracking virtually any object in a video by using a userspecified bounding box that defines the target object in the initial video frame. Learning a target model, in order to track the target in each frame, from such sparse information proves extremely challenging. Especially in adverse tracking scenarios, where the target object is frequently occluded, goes out of view, or where distractors, visually similar objects as the target, are present. Thus, we tackle the problem of robust generic object tracking in videos even in challenging scenarios in this thesis. First, we propose a novel tracking architecture that keeps track of distractor objects in order to continue tracking the target. We achieve this by learning an association network, that allows to propagate the identities of all target candidates from frame-to-frame. To tackle the problem of lacking ground-truth correspondences between distractor objects in visual tracking, we propose a training strategy that combines partial annotations with self-supervision. Second, we introduce a Transformer-based target model predictor that produces the target model. The employed Transformer captures global relations with little inductive bias, allowing it thus to learn the prediction of powerful target models even for challenging sequences. We further extend the model predictor to estimate a second set of weights, which are applied for accurate bounding box regression. Third, we propose the new visual tracking benchmark, AVisT, dedicated for tracking scenarios with adverse visibility. AVisT contains 18 diverse scenarios broadly grouped into five attributes with 42 object categories. The key contribution of AVisT are diverse and challenging scenarios, covering severe weather conditions, obstruction and adverse imaging effects, along with camouflage. Finally, we propose the task of multi-object GOT, that benefits from a wider applicability than tracking only a single generic object per video, rendering it more attractive in real-world applications. To this end, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows researchers to tackle remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. Furthermore, we propose a Transformer-based GOT tracker capable of joint processing of multiple objects through shared computation.

en_US

dc.format

application/pdf

en_US

dc.language.iso

en

en_US

dc.publisher

ETH Zurich

en_US

dc.rights.uri

http://rightsstatements.org/page/InC-NC/1.0/

dc.subject

Computer Vision

en_US

dc.subject

Deep Learning

en_US

dc.subject

Machine Learning

en_US

dc.subject

Tracking

en_US

dc.title

Tracking Generic Objects in Videos

en_US

dc.type

Doctoral Thesis

dc.rights.license

In Copyright - Non-Commercial Use Permitted

dc.date.published

2023-08-17

ethz.size

175 p.

en_US

ethz.code.ddc

DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science

en_US

ethz.identifier.diss

29277

en_US

ethz.publication.place

Zurich

en_US

ethz.publication.status

published

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02652 - Institut für Bildverarbeitung / Computer Vision Laboratory::03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)

en_US

ethz.leitzahl.certified

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02652 - Institut für Bildverarbeitung / Computer Vision Laboratory::03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)

en_US

ethz.date.deposited

2023-08-16T18:39:49Z

ethz.source

FORM

ethz.eth

yes

en_US

ethz.availability

Open access

en_US

ethz.rosetta.installDate

2023-08-17T12:52:22Z

ethz.rosetta.lastUpdated

2024-02-03T02:35:08Z

ethz.rosetta.exportRequired

true

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Tracking%20Generic%20Objects%20in%20Videos&rft.date=2023&rft.au=Mayer,%20Christoph&rft.genre=unknown&rft.btitle=Tracking%20Generic%20Objects%20in%20Videos

Search print copy at ETH Library

Files in this item

Name:: phd_thesis_christoph_mayer.pdf
Size:: 31.06Mb
Format:: Adobe PDF
Label:: Full text

Download

Publication type

Doctoral Thesis [30271]

Show simple item record

Research Collection

Search

Tracking Generic Objects in Videos Mendeley CSV RIS BibTeX

Files in this item

Publication type

Tracking Generic Objects in Videos

Mendeley

CSV

RIS

BibTeX