Multimodal Representation Learning under Weak Supervision

Daunhawer, Imant

doi:10.3929/ethz-b-000651807

Download

Full text (PDF, 26.04Mb)

Open access

Autor(in)

Daunhawer, Imant

Datum

2023

Typ

Doctoral Thesis

ETH Bibliographie

yes

Altmetrics

Download

Full text (PDF, 26.04Mb)

Rechte / Lizenz

In Copyright - Non-Commercial Use Permitted

Abstract

Biological organisms experience a world of multiple modalities through a variety of sensory systems. For example, they may perceive physical or chemical stimuli through the senses of sight, smell, taste, touch, and hearing. Across species, the nervous system integrates heterogeneous sensory stimuli and forms multimodal representations that capture information shared between modalities. Analogously, machines can perceive their environment through different types of sensors, such as cameras and microphones. Yet, it is not sufficiently well understood how multimodal representations can be formed in silico, i.e., via computer simulation. In this thesis, we study how to leverage statistical dependencies between modalities to form multimodal representations computationally using machine learning. We start from the premise that real-world data is generated from a few factors of variation. Given a set of observations, representation learning seeks to infer these latent variables, which is fundamentally impossible without further assumptions. However, when we have corresponding observations of different modalities, statistical dependencies between them can carry meaningful information about the latent structure of the underlying process. Motivated by this idea, we study multimodal learning under weak supervision, which means that we consider corresponding observations of multiple modalities without labels for what is shared between them. For this challenging setup, we design machine learning algorithms that transform observations into representations of shared and modality-specific information without explicit supervision by labels. Thus, we develop methods that infer latent structure from low-level observations using weak supervision in the form of multiple modalities. We develop techniques for multimodal representation learning using two approaches—generative and discriminative learning. First, we focus on generative learning with variational autoencoders (VAEs) and propose a principled and scalable method for variational inference and density estimation on sets of modalities. Our method enhances the encoding and disentanglement of shared and modality-specific information and consequently improves the generative performance compared to relevant baselines. Motivated by these results, we consider an explicit partitioning of the latent space into shared and modality-specific subspaces. We explore the benefits and pitfalls of partitioning and develop a model that promotes the desired disentanglement for the respective subspaces. Thereby, it further improves the generative performance compared to models with a joint latent space. On the other hand, we also establish fundamental limitations for generative learning with multimodal VAEs. We show that the sub-sampling of modalities enforces an undesirable bound on the approximation of the joint distribution. This limits the generative performance of mixture-based multimodal VAEs and constrains their application to settings where relevant information can be predicted in expectation across modalities on the level of observations. To address these issues, we shift to discriminative approaches and focus on contrastive learning. We show that contrastive learning can be used to identify shared latent factors that are invariant across modalities up to a block-wise indeterminacy, even in the presence of non-trivial statistical and causal dependencies between latent variables. Finally, we demonstrate how the representations produced by contrastive learning can be used to transcend the limitations of multimodal VAEs, which yields a hybrid approach for multimodal generative learning and the disentanglement of shared and modality-specific information. Thus, we establish a theoretical basis for multimodal representation learning and explain in which settings generative and discriminative approaches can be effective in practice. Mehr anzeigen

Persistenter Link

https://doi.org/10.3929/ethz-b-000651807

Publikationsstatus

published

Externe Links

Printexemplar via ETH-Bibliothek suchen

Beteiligte

Referent: Vogt, Julia E.
Referent: Roth, Volker
Referent: Borgwardt, Karsten M.

Verlag

ETH Zurich

Thema

Machine Learning; Computer Science

Organisationseinheit

09670 - Vogt, Julia / Vogt, Julia

Mehr

Alle Metadaten anzeigen

ETH Bibliographie

yes

Altmetrics

Research Collection

Suche

Multimodal Representation Learning under Weak Supervision Mendeley CSV RIS BibTeX

Multimodal Representation Learning under Weak Supervision

Mendeley

CSV

RIS

BibTeX