Dirichlet process mixture models for single-cell RNA-seq clustering

dc.contributor.authorAdossa Nigatu A.
dc.contributor.authorRytkönen Kalle T
dc.contributor.authorElo Laura
dc.contributor.organizationfi=InFLAMES Lippulaiva|en=InFLAMES Flagship|
dc.contributor.organizationfi=Turun biotiedekeskus|en=Turku Bioscience Centre|
dc.contributor.organizationfi=biolääketieteen laitos|en=Institute of Biomedicine|
dc.contributor.organization-code1.2.246.10.2458963.20.18586209670
dc.contributor.organization-code1.2.246.10.2458963.20.68445910604
dc.contributor.organization-code1.2.246.10.2458963.20.77952289591
dc.contributor.organization-code2609201
dc.converis.publication-id175721725
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/175721725
dc.date.accessioned2022-10-28T13:17:17Z
dc.date.available2022-10-28T13:17:17Z
dc.description.abstract<p>Clustering of cells based on gene expression is one of the major steps in single-cell RNA-sequencing (scRNA-seq) data analysis. One key challenge in cluster analysis is the unknown number of clusters and, for this issue, there is still no comprehensive solution. To enhance the process of defining meaningful cluster resolution, we compare Bayesian latent Dirichlet allocation (LDA) method to its non-parametric counterpart, hierarchical Dirichlet process (HDP) in the context of clustering scRNA-seq data. A potential main advantage of HDP is that it does not require the number of clusters as an input parameter from the user. While LDA has been used in single-cell data analysis, it has not been compared in detail with HDP. Here, we compare the cell clustering performance of LDA and HDP using four scRNA-seq datasets (immune cells, kidney, pancreas and decidua/placenta), with a specific focus on cluster numbers. Using both intrinsic (DB-index) and extrinsic (ARI) cluster quality measures, we show that the performance of LDA and HDP is dataset dependent. We describe a case where HDP produced a more appropriate clustering compared to the best performer from a series of LDA clusterings with different numbers of clusters. However, we also observed cases where the best performing LDA cluster numbers appropriately capture the main biological features while HDP tended to inflate the number of clusters. Overall, our study highlights the importance of carefully assessing the number of clusters when analyzing scRNA-seq data.<br></p>
dc.identifier.jour-issn2046-6390
dc.identifier.olddbid181057
dc.identifier.oldhandle10024/164151
dc.identifier.urihttps://www.utupub.fi/handle/11111/36933
dc.identifier.urlhttps://doi.org/10.1242/bio.059001
dc.identifier.urnURN:NBN:fi-fe2022081154533
dc.language.isoen
dc.okm.affiliatedauthorAdossa, Nigatu
dc.okm.affiliatedauthorRytkönen, Kalle
dc.okm.affiliatedauthorElo, Laura
dc.okm.discipline1182 Biochemistry, cell and molecular biologyen_GB
dc.okm.discipline1182 Biokemia, solu- ja molekyylibiologiafi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherThe Company of Biologists Ltd.
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.articlenumberbio059001
dc.relation.doi10.1242/bio.059001
dc.relation.ispartofjournalBiology Open
dc.relation.issue4
dc.relation.volume11
dc.source.identifierhttps://www.utupub.fi/handle/10024/164151
dc.titleDirichlet process mixture models for single-cell RNA-seq clustering
dc.year.issued2022

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
bio059001.pdf
Size:
1.5 MB
Format:
Adobe Portable Document Format