Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases

dc.contributorMatemaattis-luonnontieteellinen tiedekunta / Faculty of Mathematics and Natural Sciences, Department of Information Technology-
dc.contributor.authorOkser, Sebastian
dc.contributor.departmentfi=Tulevaisuuden teknologioiden laitos|en=Department of Future Technologies|
dc.contributor.facultyfi=Matemaattis-luonnontieteellinen tiedekunta|en=Faculty of Mathematics and Natural Sciences|-
dc.date.accessioned2015-07-29T10:09:19Z
dc.date.available2015-07-29T10:09:19Z
dc.date.issued2015-08-19
dc.description.abstractPersonalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have afforded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants. Machine learning-based feature selection algorithms have been shown to have the ability to effectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including filter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be effective at predicting the disease phenotypes, but also doing so efficiently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets. Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.-
dc.description.accessibilityfeatureei tietoa saavutettavuudesta
dc.description.notificationSiirretty Doriasta
dc.format.contentfulltext
dc.identifierISBN 978-952-12-3245-9-
dc.identifier.olddbid127441
dc.identifier.oldhandle10024/113043
dc.identifier.urihttps://www.utupub.fi/handle/11111/28881
dc.identifier.urnURN:ISBN:978-952-12-3245-9-
dc.language.isoeng-
dc.publisherTurku Centre for Computer Science
dc.relation.ispartofseriesTUCS Dissertations
dc.relation.issn1239-1883
dc.relation.numberinseries201-
dc.source.identifierhttps://www.utupub.fi/handle/10024/113043
dc.titleScalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases-
dc.type.ontasotfi=Artikkeliväitöskirja|en=Doctoral dissertation (article-based)|

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
TUCSD201Okser_digi.pdf
Size:
4.45 MB
Format:
Adobe Portable Document Format