Variant Calling from Next Generation Sequencing Data and Information Management
Ralph, Eliza (2014-11)
Variant Calling from Next Generation Sequencing Data and Information Management
Ralph, Eliza
(11 / 2014)
Turun yliopisto
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe20251017101941
https://urn.fi/URN:NBN:fi-fe20251017101941
Tiivistelmä
The aim of the thesis is to develop a solution for the data management of heterogeneous data sets based on the Glanville fritillary butterfly (Melitaea cinxia), a key model species in metapopulation ecology subjected to large-scale sequencing. Specific analysis needs arose from several NGS sequencing projects in the Metapopulation Research Group (MRG), University of Helsinki, requiring a method to store and having easy access to non-public datasets.
Variant calling is the process of making the distinction between biological variants and sequencing errors from Next Generation Sequencing (NGS) data. Variants discovered from calling include SNPs (single nucleotide polymorphisms) and indels (insertion/deletion mutations).
Methods for generating called variants from NGS experiments and variation analysis tools were reviewed and an example variant calling pipeline presented. The end result of variant calling produces a VCF (Variant Call Format) file, a tab-delimited file format for storage of variation data.
Technical methods for extraction, storage and analysis of data were examined. Different database paradigms, NoSQL versus relational database were compared. The ETL (Extract, Transform Load) process for extracting the data into a database was developed using SQL Server Integration Services (SSIS). The relational database model was designed in SQL Server 2014. The web-based front-end for querying data was developed in ASP.NET MVC 5.
The solution enables the study of variation between and within populations and landscapes, select interesting SNPs and aid in the design of genotyping experiments. The primary aim is to show locations of variation in genomic scaffolds. Due to use of a standardised VCF format, the application has the possibility to be applied to studying variation in other organisms.
Variant calling is the process of making the distinction between biological variants and sequencing errors from Next Generation Sequencing (NGS) data. Variants discovered from calling include SNPs (single nucleotide polymorphisms) and indels (insertion/deletion mutations).
Methods for generating called variants from NGS experiments and variation analysis tools were reviewed and an example variant calling pipeline presented. The end result of variant calling produces a VCF (Variant Call Format) file, a tab-delimited file format for storage of variation data.
Technical methods for extraction, storage and analysis of data were examined. Different database paradigms, NoSQL versus relational database were compared. The ETL (Extract, Transform Load) process for extracting the data into a database was developed using SQL Server Integration Services (SSIS). The relational database model was designed in SQL Server 2014. The web-based front-end for querying data was developed in ASP.NET MVC 5.
The solution enables the study of variation between and within populations and landscapes, select interesting SNPs and aid in the design of genotyping experiments. The primary aim is to show locations of variation in genomic scaffolds. Due to use of a standardised VCF format, the application has the possibility to be applied to studying variation in other organisms.