Active learning of molecular data for task-specific objectives

dc.contributor.authorGhosh, Kunal
dc.contributor.authorTodorović, Milica
dc.contributor.authorVehtari, Aki
dc.contributor.authorRinke, Patrick
dc.contributor.organizationfi=materiaalitekniikka|en=Materials Engineering|
dc.contributor.organization-code1.2.246.10.2458963.20.80931480620
dc.converis.publication-id477959192
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/477959192
dc.date.accessioned2025-08-27T21:35:32Z
dc.date.available2025-08-27T21:35:32Z
dc.description.abstract<p>Active learning (AL) has shown promise to be a particularly data-efficient machine learning approach. Yet, its performance depends on the application, and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes, and GP noise settings. AL was insensitive to the acquisition batch size, and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform the randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings of up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.</p>
dc.identifier.eissn1089-7690
dc.identifier.jour-issn0021-9606
dc.identifier.olddbid200688
dc.identifier.oldhandle10024/183715
dc.identifier.urihttps://www.utupub.fi/handle/11111/46725
dc.identifier.urlhttps://doi.org/10.1063/5.0229834
dc.identifier.urnURN:NBN:fi-fe2025082789205
dc.language.isoen
dc.okm.affiliatedauthorTodorovic, Milica
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline114 Physical sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline114 Fysiikkafi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherAIP Publishing
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.articlenumber014103
dc.relation.doi10.1063/5.0229834
dc.relation.ispartofjournalJournal of Chemical Physics
dc.relation.issue1
dc.relation.volume162
dc.source.identifierhttps://www.utupub.fi/handle/10024/183715
dc.titleActive learning of molecular data for task-specific objectives
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
014103_1_5.0229834.pdf
Size:
6 MB
Format:
Adobe Portable Document Format