Show simple item record

dc.contributor.author
Velten, Britta
dc.contributor.supervisor
Bühlmann, Peter
dc.contributor.supervisor
Huber, Wolfgang
dc.contributor.supervisor
Stegle, Oliver
dc.date.accessioned
2019-03-25T15:25:52Z
dc.date.available
2019-03-25T09:42:06Z
dc.date.available
2019-03-25T09:59:07Z
dc.date.available
2019-03-25T15:25:52Z
dc.date.issued
2019
dc.identifier.uri
http://hdl.handle.net/20.500.11850/333437
dc.identifier.doi
10.3929/ethz-b-000333437
dc.description.abstract
Technological advances have transformed the scientific landscape by enabling comprehensive quantitative measurements, thereby increasingly facilitating data-driven research. This includes genome biology, where many data sets nowadays comprise a collection of heterogeneous high-dimensional data modalities, collected from different assays, tissues, organisms, time points or conditions. An important example are multi-omics data, i.e. data combining measurements from multiple biological layers. Jointly, such data promise to provide a better and more comprehensive understanding of biological processes and complex traits. A critical step to realize these promises is the development of statistical and computational methods that facilitate moving from the data to sound conclusions and biological insights. For this purpose, an integrative analysis that combines information from different data modalities is essential. In this thesis, we propose novel methods that provide a multivariate approach to data integration, and we apply them in the context of multi-omics studies in precision medicine and single cell biology. Given a collection of different data modalities on a set of samples, we aim at addressing two main questions: First, how can we obtain an (unbiased) overview of the main structures that are present in the data, both within and across data modalities? And second, how can we use all data to predict a response of interest and identify relevant features, whilst taking the heterogeneity of the features into account? The first question is important in all exploratory data analysis and leads us to unsupervised methods for data integration. Finding hidden structures in the data can give important insights into biological and technical sources of variation and yield an informative low-dimensional data representation. To this end, we introduce multi-table methods and latent factor models that can capture main axes of variation and co-variation in the data. Based on this, we present a novel factor method, multi-omics factor analysis (MOFA), to integrate information from different data modalities. By sparsity assumptions on the factor loadings, MOFA decomposes variation into axes present in all, some, or single modalities and promotes interpretable factors with a direct link to molecular drivers. MOFA combines a statistical model that accommodates different data types and missing data with a scalable inference algorithm, thereby ensuring a broad applicability. Once learnt, the factors enable a range of downstream analyses, including identification of sample subgroups, outlier detection and data imputation. We demonstrate its flexibility and potential to generate biological insight by applying MOFA to a multi-omics study on chronic lymphocytic leukaemia as well as a multi-omics single cell data set. The second question leads us to supervised methods that enable building predictive models and selecting features relevant for a response of interest. Reliable methods for this purpose would have far-reaching consequences in many applications. For example, it would be extremely useful for decisions in clinical care if treatment outcome or disease progression could be predicted from available molecular or clinical data. Furthermore, the identification of important molecular markers could give insights into underlying biological mechanisms and eventually open up new treatment options. For this purpose, we turn to penalized regression methods and, based on this, develop a method for penalized regression that takes into account additional information on the features to adapt the relative strength of penalization in a data-driven manner. Such additional information in form of external covariates is available in many applications and can for example encode structural knowledge on the data, e.g. different assay types, or provide information on a feature's variance, frequency or signal-to-noise ratio. We show that incorporating informative covariates can improve prediction performance in penalized regression, and we investigate the use of important covariates in genome biology such as the omics or tissue type.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.subject
data integration
en_US
dc.subject
genome biology
en_US
dc.subject
multivariate methods
en_US
dc.subject
penalised regression
en_US
dc.subject
structured regularisation
en_US
dc.subject
latent variable model
en_US
dc.subject
factor analysis
en_US
dc.subject
variational Bayes
en_US
dc.subject
dimensionality reduction
en_US
dc.subject
multi-omics
en_US
dc.subject
heterogeneous data
en_US
dc.subject
high-dimensional data
en_US
dc.title
Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology
en_US
dc.type
Doctoral Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2019-03-25
ethz.size
156 p.
en_US
ethz.code.ddc
DDC - DDC::5 - Science::570 - Life sciences
ethz.identifier.diss
25780
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02000 - Dep. Mathematik / Dep. of Mathematics::02537 - Seminar für Statistik (SfS) / Seminar for Statistics (SfS)::03502 - Bühlmann, Peter L. / Bühlmann, Peter L.
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02000 - Dep. Mathematik / Dep. of Mathematics::02537 - Seminar für Statistik (SfS) / Seminar for Statistics (SfS)::03502 - Bühlmann, Peter L. / Bühlmann, Peter L.
en_US
ethz.date.deposited
2019-03-25T09:42:22Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2019-03-25T15:26:28Z
ethz.rosetta.lastUpdated
2024-02-02T07:25:14Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Multivariate%20Methods%20for%20Heterogeneous%20High-Dimensional%20Data%20in%20Genome%20Biology&rft.date=2019&rft.au=Velten,%20Britta&rft.genre=unknown&rft.btitle=Multivariate%20Methods%20for%20Heterogeneous%20High-Dimensional%20Data%20in%20Genome%20Biology
 Search print copy at ETH Library

Files in this item

Thumbnail

Publication type

Show simple item record