Multimodal Representation Learning under Weak Supervision
dc.contributor.author
Daunhawer, Imant
dc.contributor.supervisor
Vogt, Julia E.
dc.contributor.supervisor
Roth, Volker
dc.contributor.supervisor
Borgwardt, Karsten M.
dc.date.accessioned
2024-01-12T07:02:56Z
dc.date.available
2024-01-10T19:33:34Z
dc.date.available
2024-01-11T21:06:58Z
dc.date.available
2024-01-12T07:02:56Z
dc.date.issued
2023
dc.identifier.uri
http://hdl.handle.net/20.500.11850/651807
dc.identifier.doi
10.3929/ethz-b-000651807
dc.description.abstract
Biological organisms experience a world of multiple modalities through a variety of sensory systems. For example, they may perceive physical or chemical stimuli through the senses of sight, smell, taste, touch, and hearing. Across species, the nervous system integrates heterogeneous sensory stimuli and forms multimodal representations that capture information shared between modalities. Analogously, machines can perceive their environment through different types of sensors, such as cameras and microphones. Yet, it is not sufficiently well understood how multimodal representations can be formed in silico, i.e., via computer simulation. In this thesis, we study how to leverage statistical dependencies between modalities to form multimodal representations computationally using machine learning.
We start from the premise that real-world data is generated from a few factors of variation. Given a set of observations, representation learning seeks to infer these latent variables, which is fundamentally impossible without further assumptions. However, when we have corresponding observations of different modalities, statistical dependencies between them can carry meaningful information about the latent structure of the underlying process. Motivated by this idea, we study multimodal learning under weak supervision, which means that we consider corresponding observations of multiple modalities without labels for what is shared between them. For this challenging setup, we design machine learning algorithms that transform observations into representations of shared and modality-specific information without explicit supervision by labels. Thus, we develop methods that infer latent structure from low-level observations using weak supervision in the form of multiple modalities.
We develop techniques for multimodal representation learning using two approaches—generative and discriminative learning. First, we focus on generative learning with variational autoencoders (VAEs) and propose a principled and scalable method for variational inference and density estimation on sets of modalities. Our method enhances the encoding and disentanglement of shared and modality-specific information and consequently improves the generative performance compared to relevant baselines. Motivated by these results, we consider an explicit partitioning of the latent space into shared and modality-specific subspaces. We explore the benefits and pitfalls of partitioning and develop a model that promotes the desired disentanglement for the respective subspaces. Thereby, it further improves the generative performance compared to models with a joint latent space. On the other hand, we also establish fundamental limitations for generative learning with multimodal VAEs. We show that the sub-sampling of modalities enforces an undesirable bound on the approximation of the joint distribution. This limits the generative performance of mixture-based multimodal VAEs and constrains their application to settings where relevant information can be predicted in expectation across modalities on the level of observations.
To address these issues, we shift to discriminative approaches and focus on contrastive learning. We show that contrastive learning can be used to identify shared latent factors that are invariant across modalities up to a block-wise indeterminacy, even in the presence of non-trivial statistical and causal dependencies between latent variables. Finally, we demonstrate how the representations produced by contrastive learning can be used to transcend the limitations of multimodal VAEs, which yields a hybrid approach for multimodal generative learning and the disentanglement of shared and modality-specific information. Thus, we establish a theoretical basis for multimodal representation learning and explain in which settings generative and discriminative approaches can be effective in practice.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.subject
Machine Learning; Computer Science
en_US
dc.title
Multimodal Representation Learning under Weak Supervision
en_US
dc.type
Doctoral Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2024-01-12
ethz.size
204 p.
en_US
ethz.code.ddc
DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science
en_US
ethz.identifier.diss
29913
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09670 - Vogt, Julia / Vogt, Julia
en_US
ethz.date.deposited
2024-01-10T19:33:34Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2024-01-12T07:03:00Z
ethz.rosetta.lastUpdated
2024-01-12T07:03:00Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Multimodal%20Representation%20Learning%20under%20Weak%20Supervision&rft.date=2023&rft.au=Daunhawer,%20Imant&rft.genre=unknown&rft.btitle=Multimodal%20Representation%20Learning%20under%20Weak%20Supervision
Files in this item
Publication type
-
Doctoral Thesis [30272]