1Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreports
finnish parkinson’s disease study 
integrating protein-protein 
interaction network data with 
exome sequencing analysis
Ari Siitonen  1,2*, Laura Kytövuori  1,2, Mike A. nalls3,4, Raphael Gibbs3, Dena G. Hernandez3, 
pauli Ylikotila5,6, Markku peltonen7, Andrew B. Singleton3 & Kari Majamaa1,2
Variants associated with Parkinson’s disease (PD) have generally a small effect size and, therefore, 
large sample sizes or targeted analyses are required to detect significant associations in a whole exome 
sequencing (WES) study. Here, we used protein-protein interaction (PPI) information on 36 genes with 
established or suggested associations with PD to target the analysis of the WES data. We performed an 
association analysis on WES data from 439 Finnish PD subjects and 855 controls, and included a Finnish 
population cohort as the replication dataset with 60 PD subjects and 8214 controls. Single variant 
association (SVA) test in the discovery dataset yielded 11 candidate variants in seven genes, but the 
associations were not significant in the replication cohort after correction for multiple testing. Polygenic risk 
score using variants rs2230288 and rs2291312, however, was associated to PD with odds ratio of 2.7 (95% 
confidence interval 1.4–5.2; p < 2.56e-03). Furthermore, an analysis of the PPI network revealed enriched 
clusters of biological processes among established and candidate genes, and these functional networks 
were visualized in the study. We identified novel candidate variants for PD using a gene prioritization based 
on PPI information, and described why these variants may be involved in the pathogenesis of PD.
The genetic etiology of Parkinson’s disease (PD) is complex (see e.g.1). Many variants are associated with PD, 
but the effect of each variant seems to be small2. Hence, large sample sizes will be required in a successful search 
for new variants3, and the probability of success can be further increased by employing information on genes 
possibly associated with the disease. Reduction of variants in the analysis can be accomplished by combining 
protein-protein interaction (PPI) data with genomic data (see e.g.4).
We have previously conducted a whole exome sequencing (WES) study on Finnish PD patients and popula-
tion controls5. Here, we carried out a single variant re-analysis of these subjects and included an additional rep-
lication cohort of Finnish ancestry in the analysis. PPI information enabled us to focus the analysis on 36 genes 
with established or suggested associations with PD (PD36) and their interaction partners. Single variant analysis 
(SVA) in the discovery dataset yielded 11 candidate variants in seven genes, which were then analyzed in the rep-
lication dataset. Polygenic risk score (PRS) was calculated with two of the candidate variants and association to 
PD was tested in the replication dataset. In order to visualize possible biological processes related to these genes, 
we created PPI networks that included functional information of PD36 genes and functional information of novel 
candidate genes and genes of PD associated loci6,7.
Subjects and Methods
Study populations and WES data preparation. Details of the three Finnish studies, Mitopark, 
Stampeed and FINRISK, have been described previously5,8. Exome sequencing of the FINRISK-study has been 
performed at McDonnell Genome Institute, Washington University, and variant calling at Broad Institute as 
described previously9 and summarized in Supplementary Material.
1Research Unit of Clinical Neuroscience, University of Oulu, Oulu, Finland. 2Department of Neurology and Medical 
Research Center Oulu, Oulu University Hospital and University of Oulu, Oulu, Finland. 3Laboratory for Neurogenetics, 
National Institute on Aging, National Institutes of Health, Bethesda, MD, USA. 4Data Tecnica International, Glen 
Echo, MD, 20812, USA. 5Institute of Clinical Medicine, Department of Neurology, University of Turku, Turku, Finland. 
6Division of Clinical Neurosciences, Turku University Hospital, Turku, Finland. 7THL, Helsinki, Finland. *email: 
Ari.Siitonen@iki.fi
open
2Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreportswww.nature.com/scientificreports/
Quality control of the exome sequences in Mitopark, Stampeed and FINRISK datasets was carried out with 
methodology described previously5. The discovery dataset (men 46%) consisted of exome sequences from 
Mitopark (N = 392 cases), Stampeed (N = 493 controls) and FINRISK (N = 47 cases; N = 362 controls). Total 
genotyping rate was 0.98. A portion of FINRISK cases (N = 107) were randomly assigned to the discovery and 
replication datasets, and in the discovery dataset the FINRISK controls were matched to cases with respect to sex 
and age. Only variants found in all the three studies were included in the merged discovery dataset. The replica-
tion dataset (men 48%) consisted of 60 FINRISK cases and 8214 FINRISK controls.
The study was approved by the Ethics Committee of the Turku University Hospital. All the methods were carried 
out in accordance with the relevant guidelines and regulations and informed consent was obtained from all participants.
Gene prioritization using protein-protein interaction data. UniProt database (version 2018-02) 
was queried with the phrase parkinson disease:disease AND organism:“Homo sapiens (Human) [9606]” in order 
to find proteins related to PD. In total, 36 proteins (Table 1) with established or suggested associations with 
Parkinson’s disease (PD36 proteins) were found. Information on protein-protein interactions was downloaded 
from Integrated interactions database (IID)10. Experimentally detected human specific interactions (version 
2017-04) contained 18,627 vertices and 280,845 interactions. We then created a network (Supplementary Fig. S1) 
that included PD36 proteins and proteins that have direct interactions with these proteins. In total, the network 
(PD2300net) consisted of 2305 UniProt protein identifiers.
Whole exome sequencing data analysis. Analysis workflow is shown in Fig. 1. Whole exome sequences 
(WES) from the three studies were subjected to single-variant association (SVA) test and polygenic risk score 
(PRS) association test. Analysis was focused to 36 genes with established or suggested associations with PD 
(PD36) and 2269 genes that interacted with them in PD2300net. The discovery dataset was filtered to include 
only variants in genes in PD2300net and variants 20 kbp upstream or downstream of PD2300net genes. In total, 
there were 8091 variants in the discovery set.
SVA test and PRS association test were performed using logistic regression with the first ten principal com-
ponents as covariates. Genomic inflation factor lambda (based on median chi-squared test) was 1.0 in the dis-
covery dataset. Variants with empirically set p value threshold of 0.0005 (N = 11) were used in logistic regression 
analysis in replication dataset, using the same settings as in the discovery dataset except that age was added into 
the covariates. Bonferroni correction for multiple testing was applied to replication results. Sanger sequencing of 
samples in the discovery dataset confirmed that variants rs113574896 and chr10_105048270_AGAG_A were false 
positive findings. Variants rs2627037, rs922984, rs2291310, rs2291311, rs2291312 in TTN gene were in linkage 
disequilibrium (LD) with each other and had almost identical frequency profiles (Table 2).
Polygenic risk score was calculated using variants rs2230288 and rs2291312. These variants were selected on the 
basis of p-value cutoff 0.0005 in the SVA test and it was required that minor allele frequency was similar in the discovery 
and replication datasets. False positive variants rs113574896 and chr10_105048270_AGAG_A were removed from 
the analysis. Furthermore, all the combinations of the five variants in LD in TTN gene yielded similar PRS results and 
therefore variant rs2291312 was included to represent the LD block. Selected variants rs2230288 and rs2291312 were 
weighted by logarithm of odds ratio, and these values were added together for each sample in order to obtain PRS.
For disease prediction, the discovery dataset was used in training and the replication dataset in testing the 
model. Separate models were built on the two variants as features (variant model) and the polygenic risk score as 
the feature (PRS model). Variant model used random forest and PRS model logistic regression as classifier. The 
# Uniprot ID Gene Symbol # Uniprot ID Gene Symbol
1 Q9NQ11 ATP13A2 19 P49821 NDUFV1
2 Q9Y6H1 CHCHD2 20 Q99497 PARK7
3 O75165 DNAJC13 21 O95263 PDE8B
4 O75061 DNAJC6 22 Q9BXM7 PINK1
5 Q04637 EIF4G1 23 O60733 PLA2G6
6 Q9Y3I1 FBXO7 24 P54098 POLG
7 P04062 GBA 25 Q9UGJ0 PRKAG2
8 Q6Y7W6 GIGYF2 26 O60260 PRKN
9 O43464 HTRA2 27 P37840 SNCA
10 Q5S007 LRRK2 28 Q9Y6H5 SNCAIP
11 P10636 MAPT 29 Q13501 SQSTM1
12 P03886 MT-ND1 30 O43426 SYNJ1
13 P03897 MT-ND3 31 Q9BSA9 TMEM175
14 P03915 MT-ND5 32 Q96A57 TMEM230
15 Q8N183 NDUFAF2 33 P09936 UCHL1
16 Q5TEU4 NDUFAF5 34 P55072 VCP
17 O43181 NDUFS4 35 Q709C8 VPS13C
18 O75251 NDUFS7 36 Q96QK1 VPS35
Table 1. Genes with suggested associations with Parkinson’s disease that were used to build PD2300net.
3Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreportswww.nature.com/scientificreports/
models were evaluated with the aid of sensitivity, specificity, area under curve (AUC), balanced accuracy score 
and mean decrease in impurity as the main metrics11,12 (MDI). Balanced accuracy score is equal to the arithmetic 
mean of sensitivity and specificity.
Depth of coverage was analysed in a random sample of 400 cases or controls from the discovery dataset.
Visualization of protein-protein interaction network. Visualization network was built using exper-
imentally detected human-specific interactions of the IID database. Interactors in this network were PD36 
proteins, corresponding proteins of the candidate genes, and corresponding proteins of the genes from two 
genome-wide association study (GWAS) meta-analysis studies and from our previous WES and GWAS study5–7 
(Fig. 2). In order to select genes related to loci in GWAS studies, genes were retrieved within 250kbp up- and 
downstream of the reported chromosome positions (Supplementary Fig. S2). Genes that belonged to PD2300net 
were selected and the original chromosome positions of these genes (GWAS hits) are shown in Supplementary 
Table S1-S3. Furthermore, genes that belonged to PD2300net and were either significant in our previous WES 
study in gene-level analysis or contained de-novo variants with high effect size (Supplementary Table S4).
The visualization network (PD network 1) included edges between PD36 proteins, GWAS or WES hits and 
candidate proteins. This network of 74 proteins incorporated 29 PD36 proteins, 38 GWAS or WES hits and seven 
novel candidate protein hits with 165 edges (Fig. 3, Supplementary Table S5). The largest connected component 
of PD network 1 consisted of 64 genes.
Randomization test was performed on PD network 1. The main metrics were average path length (APL) and 
average clustering coefficient (ACC). Random graphs (N = 10,000) with size and degree distribution similar to 
PD network 1 were generated from IID database data by label permutation, and one-sided Monte Carlo p value 
was calculated for the true APL and ACC values. The degree distributions of the random graphs were preserved 
by binning proteins into 30 equal sized bins by their network degree distribution and swapping the protein labels 
between the proteins in the same degree bin without replacement.
Programs and databases. Plink 1.9b13 and R version 3.4.3 were used to prepare and analyze the exome 
sequences. Python version 3.5 with Scikit-learn library version 0.19.2 was used for logistic regression and ran-
dom forest prediction models and MDI analysis14. Linkage disequilibrium between TTN variants was tested with 
LDlink webtool setting Europeans as the population15. Depth of coverage analysis was conducted using Genome 
Analysis Toolkit version 3.116.
All networks were built and visualized using Python 3.5, Pyensembl version 1.1.0, Spark version 2.1.0, 
Graphframes version 0.5.0, Python library NetworkX version 2.1 and Gephi version 0.9.2. Random graphs for PPI 
network randomization test were built using in-house scripts. Average path length and average clustering coeffi-
cient was calculated using Stanford Network Analysis Platform (SNAP) version 5.017. Pathway and gene enrich-
ment information was acquired from GSEA v6.1 (software.broadinstitute.org/gsea) and STRING v1018 databases.
Results
Single-variant association test. We identified 8091 variants in PD2300net genes among 439 cases and 
855 controls in the discovery dataset. Several p value thresholds from 0.05 to 0.00005 were tested and the lowest 
threshold 0.0005 that identified more than ten variants was selected for further evaluation in order to keep the 
false positive rate low. Eleven variants with p value less than 0.0005 in six novel genes and one established risk 
Figure 1. Whole exome sequencing data analysis workflow.
4Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreportswww.nature.com/scientificreports/
gene (candidate genes) were selected for replication, but none of the associations was significant in the replica-
tion dataset after correction for multiple testing (Table 2). One of the 11 variants was rs2230288 in the GBA gene 
leading to p.E365K (legacy name p.E326K). The frequency of this allele was 8.5% in PD cases and 4.7% in con-
trols in the discovery dataset and similar frequencies were observed in the replication dataset (Table 2) giving an 
odds ratio of 2.1. The association of the variant with PD was not significant after correction for multiple testing, 
however.
We estimated the depth of coverage in discovery dataset. At the whole exome level 90% of all the contigs in our 
study were covered at 20x depth. However, only 19% of the PD36 genes were covered at 20×, with average total 
coverage being about 14x for these genes.
Polygenic risk score based on two variants is associated with the risk of PD. The variants 
rs2230288 and rs2291312 that passed the selection criteria were included in the calculation of polygenic risk 
score (PRS). The mean of PRS was 0.17 (range; 0.00 to 1.78) and 18% of the cases and 9% of the controls belonged 
to the upper quartile of the PRS values. The association of PRS with PD was tested in the replication dataset giving 
a mean odds ratio of 2.7 (95% confidence interval 1.4–5.2; p < 2.56e-03) (Table 3).
Discovery Set Replication Set
GENE SNP CHR BP A1 OR P C_A C_U F_A F_U OR P C_A C_U F_A F_U
UBXN11 rs117509001 1 26629342 A 5.979 0.0004805 13 6 0.0148 0.0035 NA NA 0 80 0 0.00487
GBA rs2230288 1 155206167 T 2.208 8.927e-06 74 81 0.0855 0.0474 2.137 0.02379 10 676 0.083 0.04115
TTN rs2627037 2 179606538 A 1.616 0.0004346 121 166 0.1378 0.0971 1.61 0.05265 20 1780 0.167 0.1084
TTN rs922984 2 179615887 T 1.637 0.0003337 119 163 0.1355 0.0953 1.64 0.04411 20 1749 0.167 0.1065
TTN rs2291310 2 179623758 C 1.637 0.0003337 119 163 0.1355 0.0953 1.642 0.04356 20 1747 0.167 0.1063
TTN rs2291311 2 179629461 C 1.637 0.0003337 119 163 0.1355 0.0953 1.641 0.04386 20 1748 0.167 0.1064
TTN rs2291312 2 179631214 C 1.637 0.0003337 119 163 0.1355 0.0953 1.64 0.0441 20 1749 0.167 0.1065
IKBKB rs140485496 8 42178280 T 2.666 0.0001978 31 34 0.0353 0.0199 NA NA 0 376 0 0.02289
MIR7705/PABPC1 rs113574896 8 101717195 C 3.987 1.122e-11 84 46 0.0966 0.0269 NA NA 0 10 0 0.000609
INA chr10_105048270_AGAG_A 10 105048270 A 5.722 5.064e-05 14 11 0.0294 0.0064 2.782 0.01913 6 302 0.05 0.01838
KARS/TERF2IP rs1865493 16 75681743 G 0.548 5.161e-05 84 242 0.0957 0.1417 1.169 0.54 18 2130 0.15 0.1297
Table 2. Single variants in the discovery and replication datasets. Discovery set: cases N = 439; controls 
N = 855; replication set: cases N = 60; controls N = 8214; Bonferroni cutoff p < 0.0045; OR = odds ratio; 
C_A = Allele 1 count among cases; C_U = Allele 1 count among controls; F_A = Allele 1 frequency among 
cases; F_U = Allele 1 frequency among controls.
Figure 2. Workflow of creating visualization of protein-protein interaction network. PPI, protein-protein 
interaction; WES, whole exome sequencing; GWAS, genome-wide association study; GSEA, gene set 
enrichment analysis.
5Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreportswww.nature.com/scientificreports/
Two prediction models (variant model, PRS model) were then trained with the discovery dataset and tested in 
the replication dataset. Predictive power of the two models was generally low (Tables 4 and 5). Balanced accuracy 
score was 0.56 and the area under curve (AUC) score was 57%. The models classified 27% of the controls and 38% 
of the cases as cases.
Analysis of the protein-protein interaction network. The protein-protein interaction network (PD 
network 1) included 29 of the PD36 proteins, 38 GWAS hits and seven novel candidate proteins with 165 edges 
(Fig. 3, Supplementary Table S5). The largest connected component of PD network 1 consisted of 64 genes.
Protein-protein interactions in PD network 1 revealed that the proteins encoded by the seven novel candidate 
genes interacted directly with seven PD36 genes and five GWAS or WES hits (Fig. 3). Furthermore, RALGDS was 
identified in PD network 1 as a possible source of the GWAS hit in locus chr9:135955826 (rs11243993), identified 
in our previous study5. Variant rs11243993 was identified in 14 cases and none of the controls in GWAS analysis, 
but WES analysis did not identify any significant RALGDS variants.
PRKN
NELFCD
LTK
HIST1H3J
PABPC1
UCHL1
SQSTM1
VPS35
MT-ND1
GBF1
PINK1
NEK1
NDUFAF1
COQ7
NDUFAF5
EIF4G1
TERF2IP
CHCHD2
VKORC1
SREBF1
GAK
PARK7
PRKAG2
KARS
IKBKB
ITPA
CXCL11
MAP4K4POLG
RPL19
COA3
PITX3
PSMB8
ATP13A2
RALGDS
SF3A2
GIGYF2
SNCAIP
VCP
TMEM230
FYN
NDUFV1
VPS25
C17orf53
STAT3
HTRA2
FUS
UBXN11
CCAR2
MT-ND5
PLA2G6
SCAMP3
LRRK2
TTN
INA
PSMB9
SYNJ1
TUBG1
LMNB2
SNCA
GBA
MAPT
TPM1
TREX1
STK39
TBC1D5
NDUFS7
BAG3
COL7A1
FBXO7
NDUFS4
DPM3
SH3GL2
SOCS4
Figure 3. PD network 1. Protein-protein interaction network visualizing the interactions between established 
and suggested PD genes and candidate genes. Interactions (edges) of the seven novel candidate genes are 
highlighted in red color. Abbreviations: PD36, 36 established or suggested PD genes; CANDIDATE, seven novel 
candidate genes; GWAS NALLS, GWAS hits in Nalls et al.7 meta-analysis discovery phase; GWAS CHANG, 
GWAS hits in Chang et al.6 meta-analysis discovery phase; GWAS/WES FIN, significant GWAS hits and 
selected WES hits in Siitonen et al.5.
P FDR Bonf OR 2.5% 97.5% Estimate std.error Statistic
PRS 2.56e-03 1.19e-02 3.58e-02 2.7078 1.4175 5.1728 1.00 0.33 3.016
AGE 5.26e-07 3.68e-06 7.37e-06 1.0584 1.0352 1.0821 0.06 0.01 5.016
PC1 4.16e-02 1.46e-01 5.83e-01 0 0 0.3028 −31.55 15.49 −2.037
Table 3. Logistic regression results of polygenic risk score in the replication dataset. PRS = Polygenic risk score; 
AGE = age at onset/age at sampling; PC1 = principal component 1; P = p value; FDR = False discovery rate; 
Bonf = Bonferroni correction; OR = Odds ratio; 2.5% = 95% lower confidence; 97.5% = 95% upper confidence.
6Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreportswww.nature.com/scientificreports/
Randomization test was performed to PD network 1 in order to estimate the statistical significance of the cre-
ated PPI network. The largest connected component had significantly shorter average path length (p < 0.01066) 
and significantly greater average clustering coefficient (p < 0.000414) than what was expected of random network 
with the same size and node degree distribution (Supplementary Fig. S3). This suggests that PD Network 1 could 
be considered as relatively small world in comparison to random graph.
Gene set enrichment analysis (GSEA) revealed that 28 proteins in PD network 1 were involved in phosphate 
metabolism (p < 5.72e-8), 29 proteins in phosphorylation (p < 1.62E-15) and 20 proteins in organonitrogen com-
pound metabolism (p < 5.03E-11) (Supplementary Table S6).
An interaction between the established PD proteins and candidate proteins in PD network 1 was also evident 
on inspection of the information on cell signalling by protein phosphorylation (uniprot.org) (Supplementary 
Fig. S4). Similarly, we observed a functional network, where candidate proteins interact with the rest of the net-
work, if we merged information on previously suggested biological processes in PD (such as ubiquitination, mito-
chondrial function, signaling cascades, transportation and RNA processing) into PD network 1 (Supplementary 
Fig. S5).
Discussion
We analyzed here our previous WES data now focusing on a targeted set of 2305 genes. We compared 439 Finnish 
PD cases and 855 Finnish controls in the discovery phase and replicated the results in another dataset consisting 
of subjects with Finnish ancestry. SVA test was not significant in the replication dataset, but an association was 
found between PD and PRS. In addition, protein-protein interaction network showed that the novel variants 
identified here, loci identified in a recent GWAS meta-analysis6,7, selected significant hits in our previous GWAS 
and WES study5, and known PD genes formed a network with clusters of biological processes, further suggesting 
the involvement of these proteins in PD.
We found a significant association between PRS and PD in the replication dataset with an odds ratio of 2.7. 
Predictive PRS model was able to capture 38% of the cases, although with a high rate of false negative cases. 
The prediction rate was generally low both in the PRS model and in the variant model. The relative effect of 
the p.E365K variant in GBA in the classification was higher than that of TTN in the variant model with mean 
decrease in impurity (MDI) of 58%, leaving MDI of 42% to the TTN variant.
The clinical significance of the GBA variant p.E365K has been controversial (see Clinvar: RCV000487503), 
but a recent meta-analysis provided some evidence that p.E365K may indeed be associated with PD19. The asso-
ciation was not significant in our study after correction for multiple testing, but ten out of 60 patients carried this 
mutation in the replication dataset giving an allele frequency of 0.08, which was similar to that in the discovery 
dataset. Interestingly, the allele frequency of the variant in non-Finnish European populations is 0.01 according 
to gnomAD20, whereas we found a frequency of 0.04 in the Finnish population. Without genome-wide significant 
(p < 5e-8) results, our study cannot completely define the role of rs2230288 in PD.
Five variants (rs2627037, rs922984, rs2291310, rs2291311, rs2291312) were located in the titin (TTN) gene. 
TTN is a large gene, rich in variants and, therefore, it is possible that the association reflects variant ascertainment 
or sequencing bias. The Clinvar database reports 11,148 TTN variants, among which there are at least 251 patho-
genic variants in 17 different conditions including cardiomyopathies, skeletal muscle phenotypes and congenital 
diseases21.
Titin acts in sarcomere assembly and has role in elasticity and resting tension of striated muscles22,23. The 
variants identified in our study were located in or in close proximity of immunoglobulin-like domains 19 and 20 
that account for the elasticity of titin. In addition, oxidation of the domains has been shown to lead to stiffening 
of the protein24. Interestingly, the molecular spring titin determines, at least in part, muscle stiffness and rigidity 
and tremor are the clinical hallmarks in PD25,26. Muscle stiffness likely plays a role in determining the frequency 
of oscillatory motion and therefore the changes in titin structure could impact whole-animal movement by mod-
ulating muscle stiffness.
Only 19% of the PD36 genes in WES data had a depth of coverage of 20×. This may have caused a loss of 
significant findings in established PD genes and should be taken into consideration when estimating the results.
We composed a PPI network that was based on 36 PD genes. The network enabled us to identify genes that 
interact, in addition to the established PD genes, with genes in the vicinity of GWAS hits reported in previous 
Model Accuracy Specificity Sensitivity Bal. accuracy AUC 95%CI
PRS 0.73 0.73 0.38 0.56 0.57 0.499–0.65
Variant 0.73 0.73 0.38 0.56 0.57 0.501–0.63
Table 4. Metrics of prediction models in the replication dataset. PRS = Polygenic risk score model; 
Variant = Variant model; Bal. accuracy = Balanced accuracy score; AUC = area under curve score; 
95%CI = AUC 95% confidence interval.
Predicted as cases Predicted as controls
Actual Cases 23 (TP) 37 (FN)
Actual Controls 2199 (FP) 6015 (TN)
Table 5. Confusion matrix of models in the replication dataset. TP = True positive; FN = False negative; 
FP = False positive; TN = True negative.
7Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreportswww.nature.com/scientificreports/
meta-analyses, genes in the vicinity of significant GWAS hits in our previous study and selected WES hits from 
our previous study. Not all plausible PD genes were used to build the initial gene set and although this approach 
may reduce the number of candidate genes identified, the small initial set of PD genes should have reduced the 
number of false positive findings. PPI network visualizations and GSEA described the network context around 
the identified candidate genes and supported their relation to established PD genes. Furthermore, a review of 
the literature on the seven candidate genes indicated that they may be involved in neurodegenerative diseases 
(Supplementary Table S7, Supplementary Material). Interestingly, 64 proteins in PD network 1 were connected 
via interaction suggesting a linking factor between them or a common signaling cascade. Phosphate metabolism 
and phosphorylation were among the most common processes identified in the GSEA analysis, but also other 
previously known biological processes, such as mitochondrial processing, ubiquitination and response to stress 
were identified.
GWAS hit at locus rs11243993 from our previous study was suggested to originate from RALGDS gene in 
PD network 1. RALGDS has a role in GTPase regulation and in PD network 1 the protein interacts directly with 
SNCA27. Interestingly, GTPase signaling have been suggested to be the link between genomics and etiology of 
PD28.
PPI networks were built using experimental subset of IID, which integrates data from primary data sources, 
such as BioGRID or IntAct. These datasets use various experimental techniques as the source and each have its 
own strengths and pitfalls. Here we did not filter the PPI data for interaction confidence or characterization score, 
but instead expected the genetic association test to serve as evidence for plausible interactions.
We identified novel candidate variants in PD using a combination of WES data and PPI network data. Targeted 
gene analysis, polygenic risk score association analysis and PPI network analysis indicated that these variants may 
be involved in the pathogenesis of PD. The power in our study was limited, and therefore, our findings can be seen 
as hypothesis generating and they require further investigation.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding 
author on reasonable request.
Received: 26 July 2019; Accepted: 29 November 2019;
Published: xx xx xxxx
References
 1. Ferreira, M. & Massano, J. An updated review of Parkinson’s disease genetics and clinicopathological correlations. Acta Neurol. 
Scand. 135, 273–284 (2017).
 2. Nalls, Mike A. et al. Identification of novel risk loci, causal insights, and heritable risk for Parkinson's disease: a meta-analysis of 
genome-wide association studies. The Lancet Neurology 18(12), 1091–1102 (2019).
 3. Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346 
(2014).
 4. Lage, K. Protein-protein interactions and genetic diseases: The interactome. Biochim. Biophys. Acta 1842, 1971–1980 (2014).
 5. Siitonen, A. et al. Genetics of early-onset Parkinson’s disease in Finland: exome sequencing and genome-wide association study. 
Neurobiol. Aging 53, 195.e–195.e10 (2017).
 6. Chang, D. et al. A meta-analysis of genome-wide association studies identifies 17 new Parkinson’s disease risk loci. Nat. Genet. 49, 
1511–1516 (2017).
 7. Nalls, M. A. et al. Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson’s disease. Nat. 
Genet. 46, 989–993 (2014).
 8. Zekavat, S. M. et al. Deep coverage whole genome sequences and plasma lipoprotein(a) in individuals of European and African 
ancestries. Nat. Commun. 9, 260–w (2018).
 9. Ganna, A. et al. Quantifying the Impact of Rare and Ultra-rare Coding Variation across the Phenotypic Spectrum. Am. J. Hum. 
Genet. 102, 1204–1211 (2018).
 10. Kotlyar, M., Pastrello, C., Sheahan, N. & Jurisica, I. Integrated interactions database: tissue-specific view of the human and model 
organism interactomes. Nucleic Acids Res. 44, 536 (2016).
 11. Breiman, L., Friedman, J. H. & Olshen, R. A. Stone, cj (1984) classification and regression trees. Wadsworth, Belmont, California 
(2009).
 12. Louppe, G. Understanding random forests: From theory to practice. arXiv preprint arXiv 1407, 7502 (2014).
 13. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 8. eCollection 2015 
(2015).
 14. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
 15. Machiela, M. J. & Chanock, S. J. LDlink: a web-based application for exploring population-specific haplotype structure and linking 
correlated alleles of possible functional variants. Bioinformatics 31, 3555–3557 (2015).
 16. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 
43, 491–498 (2011).
 17. Leskovec, J. & Sosič, R. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent 
Systems and Technology (TIST) 8, 1 (2016).
 18. Szklarczyk, D. et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, 447 
(2015).
 19. Huang, Y., Deng, L., Zhong, Y. & Yi, M. The Association between E326K of GBA and the Risk of Parkinson’s Disease. Parkinsons Dis. 
2018, 1048084 (2018).
 20. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
 21. Henrie, A. et al. ClinVar Miner: Demonstrating utility of a Web-based tool for viewing and filtering ClinVar data. Hum. Mutat. 39, 
1051–1060 (2018).
 22. Trinick, J. Titin as a scaffold and spring. Cytoskeleton. Curr. Biol. 6, 258–260 (1996).
 23. Maruyama, K. Connectin, an elastic protein of striated muscle. Biophys. Chem. 50, 73–85 (1994).
 24. Giganti, D., Yan, K., Badilla, C. L., Fernandez, J. M. & Alegre-Cebollada, J. Disulfide isomerization reactions in titin immunoglobulin 
domains enable a mode of protein elasticity. Nat. Commun. 9, 18–7 (2018).
8Scientific RepoRtS |         (2019) 9:18865  | https://doi.org/10.1038/s41598-019-55479-y
www.nature.com/scientificreportswww.nature.com/scientificreports/
 25. Monroy, J. A., Powers, K. L., Pace, C. M., Uyeno, T. & Nishikawa, K. C. Effects of activation on the elastic properties of intact soleus 
muscles with a deletion in titin. J. Exp. Biol. 220, 828–836 (2017).
 26. Taylor-Burt, K. R., Monroy, J., Pace, C., Lindstedt, S. & Nishikawa, K. C. Shiver me titin! Elucidating titin’s role in shivering 
thermogenesis. J. Exp. Biol. 218, 694–702 (2015).
 27. Spaargaren, M. & Bischoff, J. R. Identification of the guanine nucleotide dissociation stimulator for Ral as a putative effector 
molecule of R-ras, H-ras, K-ras, and Rap. Proceedings of the National Academy of Sciences 91, 12609–12613 (1994).
 28. Obergasteiger, J., Frapporti, G., Pramstaller, P. P., Hicks, A. A. & Volta, M. A new hypothesis for Parkinson’s disease pathogenesis: 
GTPase-p38 MAPK signaling and autophagy as convergence points of etiology and genomics. Molecular neurodegeneration 13, 40 
(2018).
Acknowledgements
We acknowledge our use of the gene set enrichment analysis, GSEA software, and Molecular Signature Database 
(MSigDB) (Subramanian, Tamayo, et al. (2005), PNAS 102, 15545–15550, http://www.broad.mit.edu/gsea/). 
The study was supported in part by a grant from the Sigrid Juselius Foundation. For funding details and 
acknowledgments, please see the Supplementary Material.
Author contributions
Data acquisition A.S., D.G.H., P.Y., M.P., A.B.S., K.M. Data preprocessing. A.S., M.N., R.G. Study design and 
study level analysis. A.S., L.K., K.M. Critical review and writing the manuscript. A.S., L.K., M.N., R.B., D.G.H., 
P.Y., M.P., A.B.S., K.M.
competing interests
Dr Nalls’ participation is supported by a consulting contract between Data Tecnica International LLC and the 
National Institute on Aging, NIH, Bethesda, MD, USA. Dr Nalls also consults for Genoom Health, Illumina 
Inc, The Michael J. Fox Foundation for Parkinson’s Research and University of California Healthcare. The other 
authors declare no competing interests.
Additional information
Supplementary information is available for this paper at https://doi.org/10.1038/s41598-019-55479-y.
Correspondence and requests for materials should be addressed to A.S.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and 
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International 
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or 
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre-
ative Commons license, and indicate if changes were made. The images or other third party material in this 
article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the article’s Creative Commons license and your intended use is not per-
mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the 
copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
 
© The Author(s) 2019