Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Humans naturally integrate various senses to understand our surroundings, enabling us to compensate for partially missing sensory input.On the contrary, machine learning models excel at harnessing extensive datasets but face challenges in handling missing data effectively.
While utilizing multiple data types provides a more comprehensive perspective, it also raises the likelihood of encountering missing values, underscoring the significance of proper missing data management in machine learning techniques.
In this thesis, we advocate for developing machine learning models that emulate the human approach of merging diverse sensory inputs into a unified representation, demonstrating resilience in the face of missing input sources. Generating labels for multiple data types is laborious and often costly, resulting in a scarcity of fully annotated multimodal datasets. On the other hand, multimodal data naturally possesses a form of weak supervision. We understand that these samples describe the same event and assume that certain underlying generative factors are shared among the group members, providing a form of weak guidance.
Our thesis focuses on learning from data characterized by weak supervision, delving into the interrelationships among group members.
We start by exploring novel techniques for machine learning models capable of processing multimodal inputs while effectively handling missing data. Our emphasis is on variational autoencoders (VAE) for learning from weakly supervised data. We introduce a generalized formulation of probabilistic aggregation functions, designed to overcome the limitations of previous methods, and we show how this generalized formulation correlates with performance enhancements.
At a higher level, we investigate the impact of implicit assumptions regarding group structure on a model's learning behavior and efficacy.
We find that the assumption of a single shared latent space is overly restrictive for generating coherent and high-quality samples. To overcome this limitation, we introduce modality-specific latent subspaces within multimodal VAEs, reflecting a more flexible modeling approach.
While we observe that greater flexibility in modeling assumptions, or assumptions aligned with the actual data generation process, leads to improved performance, we still depend on prior knowledge concerning the relationship of a group of multimodal or weakly supervised samples. As the number of group members grows, their underlying relationships become potentially more intricate, increasing the risk of overly rigid assumptions.
Therefore, in the final section, we shift our focus to minimizing the assumptions required when learning from weakly supervised data and simultaneously deducing the group structure during the learning process. In this context, we introduce a novel differentiable formulation of a random partition model, which follows a two-stage process. In the first step, we estimate the number of elements using a newly proposed differentiable formulation of the hypergeometric distribution. In the second step, we allocate the appropriate number of elements to each subset. We can demonstrate that our differentiable random partition model can learn shared and independent generative factors in the weakly supervised setting.
We aspire that this thesis and its contributions will enhance future applications in multimodal machine learning and reduce the assumptions necessary for learning from weakly supervised data in general. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000634822Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Machine Learning; Computer ScienceOrganisational unit
09670 - Vogt, Julia / Vogt, Julia
Related publications and datasets
More
Show all metadata
ETH Bibliography
yes
Altmetrics