![Thumbnail](/bitstream/handle/20.500.11850/510180/Danang_Thesis_Upload.pdf.jpg?sequence=7&isAllowed=y)
Open access
Author
Date
2021Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
The assembly of the draft Bos taurus reference genome was a milestone for genetics- and genomics-oriented research in cattle. The reference genome of domestic cattle was built from a single animal from the Hereford breed. However, the linear reference sequence does not represent the genetic diversity of global cattle breeds. The lack of diversity causes problems, particularly when DNA sequences from genetically distant animals are aligned and compared to the reference sequence. This issue is widely known as reference bias. Pangenomes are an intriguing novel reference structure to consider the full-spectrum of genetic diversity within a species. A rich, graph-based pangenome reference can integrate multiple genome assemblies and their sites of variations in a coherent and non-redundant data structure. This thesis investigated for the first time the utility of graph-based references for genomic analysis in a livestock population.
Chapter 2 assessed the feasibility of graph-based genomic analysis in cattle. Specifically, a graph-based sequence variant genotyping approach was implemented using the Graphtyper software and compared to two widely-used methods (SAMtools and GATK) that rely on a strictly linear representation of the reference using whole-genome sequencing data of 49 Original Braunvieh cattle. A comparison between sequence variant and array-derived genotypes indicated that the graph-based approach outperformed both SAMtools and GATK with regard to genotype concordance, non-reference sensitivity, non-reference discrepancy, and Mendelian consistency of genotypes observed in parent-offspring pairs. These findings demonstrated that graph-based genotyping using Graphtyper is accurate, sensitive, and computationally feasible in the cattle genome.
Chapter 3 reports on the construction of breed-specific and multi-breed genome graphs for four European cattle breeds (Original Braunvieh, Brown Swiss, Fleckvieh, and Holstein). The vg toolkit was used to augment the linear Hereford-based reference sequence with variants that were prioritized based on allele frequency in different breeds. Based on both real and simulated short-read sequencing data, this study showed that variant prioritization is crucial to build informative genome graphs. Intriguingly, adding many low frequency and rare variants to the genome graphs compromised mapping accuracy. Moreover, this chapter demonstrated that multi-breed graphs and breed-specific graphs enable almost identical mapping improvements over a linear reference genome. Finally, the first whole-genome graph was constructed for the Brown Swiss cattle breed using 14 million variants. The application of this whole-genome graph facilitated accurate short-read mapping and unbiased sequence variant discovery.
Chapter 4 reports on integrating six reference-quality bovine genome assemblies into a unified multi-assembly graph using the minigraph software. The pangenome contains 70 megabases that are not present in the current ARS-UCD1.2 Bos taurus reference genome. Using complementary bioinformatics approaches, this chapter provides compelling evidence that these non-reference sequences contain functionally active and biologically-relevant elements. Specifically, the analysis of transcriptome data revealed putatively novel genes, including some that are differentially expressed between individual animals. Moreover, variant discovery in the non-reference sequences revealed thousands of yet undetected polymorphic sites capturing genetic differentiation across cattle breeds. This chapter demonstrated that multi-assembly graphs make so far neglected genetic variations amenable to genetic investigations.
Overall, this thesis presents a novel analysis paradigm in livestock genomics by leveraging variation-aware reference structures. The analyses presented in this thesis provide a first step towards the transition from linear to graph-based reference structures in order to mitigate inherent biases of the linear reference genome. Importantly, this thesis establishes a computational framework to integrate multiple genome assemblies and their sites of variations into a more diverse reference structure broadly applicable across species. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000510180Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Pangenome; Bioinformatics; Genomics; GeneticsOrganisational unit
09575 - Pausch, Hubert / Pausch, Hubert
More
Show all metadata
ETH Bibliography
yes
Altmetrics