Multimodal Representation Learning under Weak Supervision

Daunhawer, Imant

doi:10.3929/ethz-b-000651807

Show simple item record

dc.contributor.author

Daunhawer, Imant

dc.contributor.supervisor

Vogt, Julia E.

dc.contributor.supervisor

Roth, Volker

dc.contributor.supervisor

Borgwardt, Karsten M.

dc.date.accessioned

2024-01-12T07:02:56Z

dc.date.available

2024-01-10T19:33:34Z

dc.date.available

2024-01-11T21:06:58Z

dc.date.available

2024-01-12T07:02:56Z

dc.date.issued

2023

dc.identifier.uri

http://hdl.handle.net/20.500.11850/651807

dc.identifier.doi

10.3929/ethz-b-000651807

dc.description.abstract

Biological organisms experience a world of multiple modalities through a variety of sensory systems. For example, they may perceive physical or chemical stimuli through the senses of sight, smell, taste, touch, and hearing. Across species, the nervous system integrates heterogeneous sensory stimuli and forms multimodal representations that capture information shared between modalities. Analogously, machines can perceive their environment through different types of sensors, such as cameras and microphones. Yet, it is not sufficiently well understood how multimodal representations can be formed in silico, i.e., via computer simulation. In this thesis, we study how to leverage statistical dependencies between modalities to form multimodal representations computationally using machine learning. We start from the premise that real-world data is generated from a few factors of variation. Given a set of observations, representation learning seeks to infer these latent variables, which is fundamentally impossible without further assumptions. However, when we have corresponding observations of different modalities, statistical dependencies between them can carry meaningful information about the latent structure of the underlying process. Motivated by this idea, we study multimodal learning under weak supervision, which means that we consider corresponding observations of multiple modalities without labels for what is shared between them. For this challenging setup, we design machine learning algorithms that transform observations into representations of shared and modality-specific information without explicit supervision by labels. Thus, we develop methods that infer latent structure from low-level observations using weak supervision in the form of multiple modalities. We develop techniques for multimodal representation learning using two approaches—generative and discriminative learning. First, we focus on generative learning with variational autoencoders (VAEs) and propose a principled and scalable method for variational inference and density estimation on sets of modalities. Our method enhances the encoding and disentanglement of shared and modality-specific information and consequently improves the generative performance compared to relevant baselines. Motivated by these results, we consider an explicit partitioning of the latent space into shared and modality-specific subspaces. We explore the benefits and pitfalls of partitioning and develop a model that promotes the desired disentanglement for the respective subspaces. Thereby, it further improves the generative performance compared to models with a joint latent space. On the other hand, we also establish fundamental limitations for generative learning with multimodal VAEs. We show that the sub-sampling of modalities enforces an undesirable bound on the approximation of the joint distribution. This limits the generative performance of mixture-based multimodal VAEs and constrains their application to settings where relevant information can be predicted in expectation across modalities on the level of observations. To address these issues, we shift to discriminative approaches and focus on contrastive learning. We show that contrastive learning can be used to identify shared latent factors that are invariant across modalities up to a block-wise indeterminacy, even in the presence of non-trivial statistical and causal dependencies between latent variables. Finally, we demonstrate how the representations produced by contrastive learning can be used to transcend the limitations of multimodal VAEs, which yields a hybrid approach for multimodal generative learning and the disentanglement of shared and modality-specific information. Thus, we establish a theoretical basis for multimodal representation learning and explain in which settings generative and discriminative approaches can be effective in practice.

en_US

dc.format

application/pdf

en_US

dc.language.iso

en

en_US

dc.publisher

ETH Zurich

en_US

dc.rights.uri

http://rightsstatements.org/page/InC-NC/1.0/

dc.subject

Machine Learning; Computer Science

en_US

dc.title

Multimodal Representation Learning under Weak Supervision

en_US

dc.type

Doctoral Thesis

dc.rights.license

In Copyright - Non-Commercial Use Permitted

dc.date.published

2024-01-12

ethz.size

204 p.

en_US

ethz.code.ddc

DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science

en_US

ethz.identifier.diss

29913

en_US

ethz.publication.place

Zurich

en_US

ethz.publication.status

published

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09670 - Vogt, Julia / Vogt, Julia

en_US

ethz.date.deposited

2024-01-10T19:33:34Z

ethz.source

FORM

ethz.eth

yes

en_US

ethz.availability

Open access

en_US

ethz.rosetta.installDate

2024-01-12T07:03:00Z

ethz.rosetta.lastUpdated

2024-01-12T07:03:00Z

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Multimodal%20Representation%20Learning%20under%20Weak%20Supervision&rft.date=2023&rft.au=Daunhawer,%20Imant&rft.genre=unknown&rft.btitle=Multimodal%20Representation%20Learning%20under%20Weak%20Supervision

Search print copy at ETH Library

Files in this item

Name:: thesis_daunhawer.pdf
Size:: 26.04Mb
Format:: Adobe PDF
Label:: Full text

Download

Publication type

Doctoral Thesis [30272]

Show simple item record

Research Collection

Search

Multimodal Representation Learning under Weak Supervision Mendeley CSV RIS BibTeX

Files in this item

Publication type

Multimodal Representation Learning under Weak Supervision

Mendeley

CSV

RIS

BibTeX