Active learning of molecular data for task-specific objectives

Ghosh, Kunal; Todorović, Milica; Vehtari, Aki; Rinke, Patrick

Active learning of molecular data for task-specific objectives

dc.contributor.author	Ghosh, Kunal
dc.contributor.author	Todorović, Milica
dc.contributor.author	Vehtari, Aki
dc.contributor.author	Rinke, Patrick
dc.contributor.organization	fi=materiaalitekniikka\|en=Materials Engineering\|
dc.contributor.organization-code	1.2.246.10.2458963.20.80931480620
dc.converis.publication-id	477959192
dc.converis.url	https://research.utu.fi/converis/portal/Publication/477959192
dc.date.accessioned	2025-08-27T21:35:32Z
dc.date.available	2025-08-27T21:35:32Z
dc.description.abstract	<p>Active learning (AL) has shown promise to be a particularly data-efficient machine learning approach. Yet, its performance depends on the application, and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes, and GP noise settings. AL was insensitive to the acquisition batch size, and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform the randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings of up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.</p>
dc.identifier.eissn	1089-7690
dc.identifier.jour-issn	0021-9606
dc.identifier.olddbid	200688
dc.identifier.oldhandle	10024/183715
dc.identifier.uri	https://www.utupub.fi/handle/11111/46725
dc.identifier.url	https://doi.org/10.1063/5.0229834
dc.identifier.urn	URN:NBN:fi-fe2025082789205
dc.language.iso	en
dc.okm.affiliatedauthor	Todorovic, Milica
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	AIP Publishing
dc.publisher.country	United States	en_GB
dc.publisher.country	Yhdysvallat (USA)	fi_FI
dc.publisher.country-code	US
dc.relation.articlenumber	14103
dc.relation.doi	10.1063/5.0229834
dc.relation.ispartofjournal	Journal of Chemical Physics
dc.relation.issue	1
dc.relation.volume	162
dc.source.identifier	https://www.utupub.fi/handle/10024/183715
dc.title	Active learning of molecular data for task-specific objectives
dc.year.issued	2025

Tiedostot

Näytetään 1 - 1 / 1

Name:: 014103_1_5.0229834.pdf
Size:: 6 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet