Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology

Velten, Britta

doi:10.3929/ethz-b-000333437

Show simple item record

dc.contributor.author

Velten, Britta

dc.contributor.supervisor

Bühlmann, Peter

dc.contributor.supervisor

Huber, Wolfgang

dc.contributor.supervisor

Stegle, Oliver

dc.date.accessioned

2019-03-25T15:25:52Z

dc.date.available

2019-03-25T09:42:06Z

dc.date.available

2019-03-25T09:59:07Z

dc.date.available

2019-03-25T15:25:52Z

dc.date.issued

2019

dc.identifier.uri

http://hdl.handle.net/20.500.11850/333437

dc.identifier.doi

10.3929/ethz-b-000333437

dc.description.abstract

Technological advances have transformed the scientific landscape by enabling comprehensive quantitative measurements, thereby increasingly facilitating data-driven research. This includes genome biology, where many data sets nowadays comprise a collection of heterogeneous high-dimensional data modalities, collected from different assays, tissues, organisms, time points or conditions. An important example are multi-omics data, i.e. data combining measurements from multiple biological layers. Jointly, such data promise to provide a better and more comprehensive understanding of biological processes and complex traits. A critical step to realize these promises is the development of statistical and computational methods that facilitate moving from the data to sound conclusions and biological insights. For this purpose, an integrative analysis that combines information from different data modalities is essential. In this thesis, we propose novel methods that provide a multivariate approach to data integration, and we apply them in the context of multi-omics studies in precision medicine and single cell biology. Given a collection of different data modalities on a set of samples, we aim at addressing two main questions: First, how can we obtain an (unbiased) overview of the main structures that are present in the data, both within and across data modalities? And second, how can we use all data to predict a response of interest and identify relevant features, whilst taking the heterogeneity of the features into account? The first question is important in all exploratory data analysis and leads us to unsupervised methods for data integration. Finding hidden structures in the data can give important insights into biological and technical sources of variation and yield an informative low-dimensional data representation. To this end, we introduce multi-table methods and latent factor models that can capture main axes of variation and co-variation in the data. Based on this, we present a novel factor method, multi-omics factor analysis (MOFA), to integrate information from different data modalities. By sparsity assumptions on the factor loadings, MOFA decomposes variation into axes present in all, some, or single modalities and promotes interpretable factors with a direct link to molecular drivers. MOFA combines a statistical model that accommodates different data types and missing data with a scalable inference algorithm, thereby ensuring a broad applicability. Once learnt, the factors enable a range of downstream analyses, including identification of sample subgroups, outlier detection and data imputation. We demonstrate its flexibility and potential to generate biological insight by applying MOFA to a multi-omics study on chronic lymphocytic leukaemia as well as a multi-omics single cell data set. The second question leads us to supervised methods that enable building predictive models and selecting features relevant for a response of interest. Reliable methods for this purpose would have far-reaching consequences in many applications. For example, it would be extremely useful for decisions in clinical care if treatment outcome or disease progression could be predicted from available molecular or clinical data. Furthermore, the identification of important molecular markers could give insights into underlying biological mechanisms and eventually open up new treatment options. For this purpose, we turn to penalized regression methods and, based on this, develop a method for penalized regression that takes into account additional information on the features to adapt the relative strength of penalization in a data-driven manner. Such additional information in form of external covariates is available in many applications and can for example encode structural knowledge on the data, e.g. different assay types, or provide information on a feature's variance, frequency or signal-to-noise ratio. We show that incorporating informative covariates can improve prediction performance in penalized regression, and we investigate the use of important covariates in genome biology such as the omics or tissue type.

en_US

dc.format

application/pdf

en_US

dc.language.iso

en

en_US

dc.publisher

ETH Zurich

en_US

dc.rights.uri

http://rightsstatements.org/page/InC-NC/1.0/

dc.subject

data integration

en_US

dc.subject

genome biology

en_US

dc.subject

multivariate methods

en_US

dc.subject

penalised regression

en_US

dc.subject

structured regularisation

en_US

dc.subject

latent variable model

en_US

dc.subject

factor analysis

en_US

dc.subject

variational Bayes

en_US

dc.subject

dimensionality reduction

en_US

dc.subject

multi-omics

en_US

dc.subject

heterogeneous data

en_US

dc.subject

high-dimensional data

en_US

dc.title

Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology

en_US

dc.type

Doctoral Thesis

dc.rights.license

In Copyright - Non-Commercial Use Permitted

dc.date.published

2019-03-25

ethz.size

156 p.

en_US

ethz.code.ddc

DDC - DDC::5 - Science::570 - Life sciences

ethz.identifier.diss

25780

en_US

ethz.publication.place

Zurich

en_US

ethz.publication.status

published

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02000 - Dep. Mathematik / Dep. of Mathematics::02537 - Seminar für Statistik (SfS) / Seminar for Statistics (SfS)::03502 - Bühlmann, Peter L. / Bühlmann, Peter L.

en_US

ethz.leitzahl.certified

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02000 - Dep. Mathematik / Dep. of Mathematics::02537 - Seminar für Statistik (SfS) / Seminar for Statistics (SfS)::03502 - Bühlmann, Peter L. / Bühlmann, Peter L.

en_US

ethz.date.deposited

2019-03-25T09:42:22Z

ethz.source

FORM

ethz.eth

yes

en_US

ethz.availability

Open access

en_US

ethz.rosetta.installDate

2019-03-25T15:26:28Z

ethz.rosetta.lastUpdated

2024-02-02T07:25:14Z

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Multivariate%20Methods%20for%20Heterogeneous%20High-Dimensional%20Data%20in%20Genome%20Biology&rft.date=2019&rft.au=Velten,%20Britta&rft.genre=unknown&rft.btitle=Multivariate%20Methods%20for%20Heterogeneous%20High-Dimensional%20Data%20in%20Genome%20Biology

Search print copy at ETH Library

Files in this item

Name:: thesis_bvelten.pdf
Size:: 20.08Mb
Format:: Adobe PDF
Label:: Full text

Download

Publication type

Doctoral Thesis [30094]

Show simple item record

Research Collection

Search

Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology Mendeley CSV RIS BibTeX

Files in this item

Publication type

Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology

Mendeley

CSV

RIS

BibTeX