Domain-Robust Network Architectures and Training Strategies for Visual Scene Understanding

Hoyer, Lukas

doi:10.3929/ethz-b-000702004

Show simple item record

dc.contributor.author

Hoyer, Lukas

dc.contributor.supervisor

Van Gool, Luc

dc.contributor.supervisor

Dai, Dengxin

dc.contributor.supervisor

Schiele, Bernt

dc.contributor.supervisor

Salzmann, Mathieu

dc.date.accessioned

2024-10-28T08:28:55Z

dc.date.available

2024-10-27T18:34:11Z

dc.date.available

2024-10-28T08:28:55Z

dc.date.issued

2024

dc.identifier.uri

http://hdl.handle.net/20.500.11850/702004

dc.identifier.doi

10.3929/ethz-b-000702004

dc.description.abstract

Understanding the content of images is an important part of many applications in autonomous driving, augmented reality, robotics, medical imaging, and remote sensing. With the breakthrough of deep neural networks, semantic image understanding has substantially progressed in the last few years. However, neural networks require large amounts of annotated data to be trained properly. As the annotation of large-scale real-world datasets is a costly process, the network can instead be trained on a dataset with existing or cheaper annotations such as automatically labeled synthetic data. Unfortunately, neural networks are usually sensitive to domain shifts so that they perform rather poorly on domains different from the training data. Therefore, unsupervised domain adaptation (UDA) and domain generalization (DG) methods aim to enable a model trained on a source domain (e.g. synthetic data) to perform well on unlabeled or even unseen target domains (e.g. real-world data). Most UDA/DG research focused specifically on the design of adaptation and generalization techniques to overcome the problem of domain shifts. However, the influence of other aspects of the learning framework on domain robustness has been mostly overlooked. Therefore, we newly take a more holistic view on domain robustness and study the impact of different aspects of the learning framework on UDA and DG including the network architecture, general training schemes, image resolution, crop size, and context information. In particular, we address the following problems of existing DG and UDA methods: (1) Instead of relying on generic and outdated segmentation architectures for evaluating DG/UDA strategies, we study the influence of recent architectures on domain-robust semantic/panoptic segmentation and design a network architecture specifically tailored for domain-generalizable and domain-adaptive segmentation. (2) To avoid overfitting to the source domain, we propose general training strategies that preserve prior knowledge. (3) To achieve fine segmentation details under the increased GPU memory consumption of DG/UDA, we propose a domain-robust and memory-efficient multi-resolution training framework. (4) To resolve local appearance ambiguities on the target domain, we propose a method to enhance the learning of spatial context relations. These contributions are detailed in the following paragraphs. As previous UDA and DG semantic segmentation methods are mostly based on outdated DeepLabV2 networks with ResNet backbones, we benchmark more recent architectures, reveal the potential of Transformers, and design the DAFormer network architecture tailored for UDA and DG. It consists of a hierarchical Transformer encoder and a multi-level context-aware feature fusion decoder. The DAFormer network is enabled by three simple but crucial training strategies to stabilize the training and avoid overfitting to the source domain: While Rare Class Sampling on the source domain improves the quality of the pseudo-labels by mitigating the confirmation bias of self-training toward common classes, a Thing-Class ImageNet Feature Distance and a learning rate warmup promote feature transfer from ImageNet pre-training. With these techniques, DAFormer achieves major performance advances in UDA and DG and enables learning even difficult classes such as train, bus, and truck. Further, we study principal architecture designs for panoptic segmentation with respect to their UDA capabilities. We show that previous panoptic UDA methods took suboptimal design choices. Based on the findings, we propose EDAPS, a network architecture that is particularly designed for domain-adaptive panoptic segmentation. It uses a shared, domain-robust Transformer encoder to facilitate the joint adaptation of semantic and instance features, but task-specific decoders tailored for the specific requirements of both domain-adaptive semantic and instance segmentation. While DAFormer and EDAPS can better distinguish different classes, we observe that they lack fine segmentation details. We pinpoint the reason to the use of downscaled images, which result in low-resolution predictions. However, naively using full high-resolution images is infeasible due to the higher GPU memory consumption of UDA/DG compared to supervised methods. The alternative of training with random crops of high-resolution images alleviates this problem but falls short in capturing long-range, domain-robust context information. Therefore, we propose HRDA, a multi-resolution training approach for UDA and DG, that combines the strengths of small high-resolution crops to preserve fine segmentation details and large low-resolution crops to capture long-range context dependencies with a learned scale attention while maintaining a manageable GPU memory footprint. HRDA enables adapting small objects and preserving fine segmentation details, significantly improving the performance of previous UDA and DG methods. Even with the improved discriminative and high-resolution abilities of DAFormer and HRDA, UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, we propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks and domain gaps. Overall, this thesis reveals the importance of a holistic view on the different aspects of the learning framework, such as network architectures and general training strategies, for domain-robust visual scene understanding. The presented methods majorly improve the performance on synthetic-to-real, day-to-nighttime, and clear-to-adverse weather domain adaptation across several perception tasks. For instance, they achieve an overall gain of +18.4 mIoU for semantic segmentation on GTA-to-Cityscapes. Beyond adaptation, DAFormer and HRDA even work in the more challenging domain generalization setting, where they improve the performance by +12.0 mIoU when generalizing from GTA to 5 unseen real-world datasets. The implementations are open-sourced and available at https://github.com/lhoyer.

en_US

dc.format

application/pdf

en_US

dc.language.iso

en

en_US

dc.publisher

ETH Zurich

en_US

dc.rights.uri

http://rightsstatements.org/page/InC-NC/1.0/

dc.title

Domain-Robust Network Architectures and Training Strategies for Visual Scene Understanding

en_US

dc.type

Doctoral Thesis

dc.rights.license

In Copyright - Non-Commercial Use Permitted

dc.date.published

2024-10-28

ethz.size

214 p.

en_US

ethz.code.ddc

DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science

en_US

ethz.identifier.diss

30318

en_US

ethz.publication.place

Zurich

en_US

ethz.publication.status

published

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02652 - Institut für Bildverarbeitung / Computer Vision Laboratory::03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)

en_US

ethz.date.deposited

2024-10-27T18:34:11Z

ethz.source

FORM

ethz.eth

yes

en_US

ethz.availability

Open access

en_US

ethz.rosetta.installDate

2024-10-28T08:28:57Z

ethz.rosetta.lastUpdated

2024-10-28T08:28:57Z

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Domain-Robust%20Network%20Architectures%20and%20Training%20Strategies%20for%20Visual%20Scene%20Understanding&rft.date=2024&rft.au=Hoyer,%20Lukas&rft.genre=unknown&rft.btitle=Domain-Robust%20Network%20Architectures%20and%20Training%20Strategies%20for%20Visual%20Scene%20Understanding

Search print copy at ETH Library

Files in this item

Name:: PhD_Thesis_Lukas_Hoyer_Digital.pdf
Size:: 41.60Mb
Format:: Adobe PDF
Label:: Full text

Download

Publication type

Doctoral Thesis [30228]

Show simple item record

Research Collection

Search

Domain-Robust Network Architectures and Training Strategies for Visual Scene Understanding Mendeley CSV RIS BibTeX

Files in this item

Publication type

Domain-Robust Network Architectures and Training Strategies for Visual Scene Understanding

Mendeley

CSV

RIS

BibTeX