Tracking Generic Objects in Videos
dc.contributor.author
Mayer, Christoph
dc.contributor.supervisor
Van Gool, Luc
dc.contributor.supervisor
Vedaldi, Andrea
dc.contributor.supervisor
Ling, Haibin
dc.contributor.supervisor
Danelljan, Martin
dc.date.accessioned
2023-08-17T12:52:20Z
dc.date.available
2023-08-16T18:39:48Z
dc.date.available
2023-08-17T11:12:00Z
dc.date.available
2023-08-17T12:27:05Z
dc.date.available
2023-08-17T12:52:20Z
dc.date.issued
2023
dc.identifier.uri
http://hdl.handle.net/20.500.11850/626971
dc.identifier.doi
10.3929/ethz-b-000626971
dc.description.abstract
Visual object tracking is a fundamental problem in computer vision and finds its application in multiple tasks such as autonomous driving, robotics, surveillance, video understanding, and sports analysis. Generic Object Tracking (GOT) is a specialized tracking task that aims at tracking virtually any object in a video by using a userspecified bounding box that defines the target object in the initial video frame. Learning a target model, in order to track the target in each frame, from such sparse information proves extremely challenging. Especially in adverse tracking scenarios, where the target object is frequently occluded, goes out of view, or where distractors, visually similar objects as the target, are present. Thus, we tackle the problem of robust generic object tracking in videos even in challenging scenarios in this thesis.
First, we propose a novel tracking architecture that keeps track of distractor objects in order to continue tracking the target. We achieve this by learning an association network, that allows to propagate the identities of all target candidates from frame-to-frame. To tackle the problem of lacking ground-truth correspondences between distractor objects in visual tracking, we propose a training strategy that combines partial annotations with self-supervision.
Second, we introduce a Transformer-based target model predictor that produces the target model. The employed Transformer captures global relations with little inductive bias, allowing it thus to learn the prediction of powerful target models even for challenging sequences. We further extend the model predictor to estimate a second set of weights, which are applied for accurate bounding box regression.
Third, we propose the new visual tracking benchmark, AVisT, dedicated for tracking scenarios with adverse visibility. AVisT contains 18 diverse scenarios broadly grouped into five attributes with 42 object categories. The key contribution of AVisT are diverse and challenging scenarios, covering severe weather conditions, obstruction and adverse imaging effects, along with camouflage.
Finally, we propose the task of multi-object GOT, that benefits from a wider applicability than tracking only a single generic object per video, rendering it more attractive in real-world applications. To this end, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows researchers to tackle remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. Furthermore, we propose a Transformer-based GOT tracker capable of joint processing of multiple objects through shared computation.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.subject
Computer Vision
en_US
dc.subject
Deep Learning
en_US
dc.subject
Machine Learning
en_US
dc.subject
Tracking
en_US
dc.title
Tracking Generic Objects in Videos
en_US
dc.type
Doctoral Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2023-08-17
ethz.size
175 p.
en_US
ethz.code.ddc
DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science
en_US
ethz.identifier.diss
29277
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02652 - Institut für Bildverarbeitung / Computer Vision Laboratory::03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02652 - Institut für Bildverarbeitung / Computer Vision Laboratory::03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)
en_US
ethz.date.deposited
2023-08-16T18:39:49Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2023-08-17T12:52:22Z
ethz.rosetta.lastUpdated
2024-02-03T02:35:08Z
ethz.rosetta.exportRequired
true
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Tracking%20Generic%20Objects%20in%20Videos&rft.date=2023&rft.au=Mayer,%20Christoph&rft.genre=unknown&rft.btitle=Tracking%20Generic%20Objects%20in%20Videos
Files in this item
Publication type
-
Doctoral Thesis [30271]