Representation Learning for Dimensionality Reduction, Irregularly-Sampled Sequences and Graphs
Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Machine learning has the potential to revolutionize the fields of biology and healthcare by providing new tools to help scientists and clinicians do research and decide what would be the right treatment for patients. However, while recent approaches in representation learning give the impression of being universal black-box solutions to all problems, research has shown that this is not generally true. Even though models can perform well in a black-box fashion, they often suffer from low generalization and are sensitive to distribution shifts. This highlights the need for developing approaches that are informed by their downstream application and tailored to incorporate symmetries of the problem into the model architecture. These inductive biases are essential for performance on new data and for models to remain robust even when the data distribution changes. Nevertheless, constructing good models is only half of the solution. To be sure that models would translate well into clinical applications they also need to be evaluated appropriately with this goal in mind. In this thesis, I address the above points while taking a detailed look at structured data types present at the intersection of biology, medicine, and machine learning. In terms of algorithmic contributions, I first present a new non-linear dimensionality reduction algorithm that aims to preserve multi-scale relations. The cost reduction of genome sequencing and the ability to sequence individual cells has led to exponentially increasing high-dimensional data in the life sciences. Such data cannot be intuitively understood, making dimensionality reduction approaches, which can capture the nested relationships present in biology, essential. Second, I develop methods for clinical applications where irregularly-sampled data are present. Conventional machine learning models either require the conversion of such data into fixed-size representations or the imputation of missing values prior to their application. I present two approaches tailored for irregularly-sampled data that do not require such preprocessing steps. The first is a new kernel for peaks derived from MALDI-TOF spectra, whereas the second is a deep learning model that can be applied to irregularly-sampled time series by phrasing them as sets of observations. Third, I present an extension to graph neural networks that allow the models to account for global information instead of requiring nodes to only exchange information with their neighbors. Graphs are an important data structure for pharmacology as they are often used to represent small molecules. In order to address the appropriate evaluation of such models, I present a detailed study of medical time series models with a focus on their capability to transfer to other datasets in the context of a sepsis early prediction task. Further, I show that the conventional approach for the evaluation of graph generative models is highly sensitive to the selection of hyperparameters which can lead to biased performance estimates. Summarizing, my thesis addresses many problems at the intersection of machine learning, healthcare, and biology. It demonstrates how models can be improved by including more (domain-specific) knowledge and where to pay attention when evaluating said models. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000602440Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Machine Learning; Dimensionality reduction; Time Series; Graphs; HealthcareOrganisational unit
09486 - Borgwardt, Karsten M. (ehemalig) / Borgwardt, Karsten M. (former)
More
Show all metadata
ETH Bibliography
yes
Altmetrics