Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature

dc.contributor.authorSipilä, Matilda
dc.contributor.authorMehryary, Farrokh
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorGinter, Filip
dc.contributor.authorTodorovic, Milica
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=materiaalitekniikka|en=Materials Engineering|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code1.2.246.10.2458963.20.80931480620
dc.converis.publication-id499752519
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/499752519
dc.date.accessioned2026-01-21T14:51:58Z
dc.date.available2026-01-21T14:51:58Z
dc.description.abstract<p>Scientific literature provides a variety of experimental and theoretical data which, if extracted, could offer new opportunities for data-driven discovery in materials research. Natural language processing (NLP) tools enable information extraction (IE) of structured information from unstructured text. The performance of IE tools needs to be systematically evaluated on manually annotated test datasets, but there are few publicly available annotated materials science datasets and none on perovskites, promising materials for photovoltaics. We present a perovskite literature dataset with 600 text segments extracted from an open access manuscript corpus. The PV600 dataset focuses on five inorganic and hybrid perovskites and contains 227 manually annotated bandgap values identified from 188 segments. Moreover, we recorded the bandgap type, whether it was experimental, computational, from the literature, or from unknown source. To demonstrate the intended use of the dataset, we applied it to evaluate the IE performance of a question answering (QA) method, a rule-based method, and generative language models (LLMs). We exhibit a further application in testing segment preselection with LLMs in IE.<br></p>
dc.identifier.eissn2052-4463
dc.identifier.jour-issn2052-4463
dc.identifier.olddbid213805
dc.identifier.oldhandle10024/196823
dc.identifier.urihttps://www.utupub.fi/handle/11111/55950
dc.identifier.urlhttps://www.nature.com/articles/s41597-025-05637-x
dc.identifier.urnURN:NBN:fi-fe202601217035
dc.language.isoen
dc.okm.affiliatedauthorSipilä, Matilda
dc.okm.affiliatedauthorMehryary, Farrokh
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.affiliatedauthorGinter, Filip
dc.okm.affiliatedauthorTodorovic, Milica
dc.okm.discipline112 Statistics and probabilityen_GB
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline216 Materials engineeringen_GB
dc.okm.discipline112 Tilastotiedefi_FI
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline216 Materiaalitekniikkafi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 DataArticle
dc.publisherNATURE PORTFOLIO
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.publisher.placeBERLIN
dc.relation.articlenumber1401
dc.relation.doi10.1038/s41597-025-05637-x
dc.relation.ispartofjournalScientific Data
dc.relation.volume12
dc.source.identifierhttps://www.utupub.fi/handle/10024/196823
dc.titleAnnotated textual dataset PV600 of perovskite bandgaps for information extraction from literature
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
s41597-025-05637-x.pdf
Size:
3.12 MB
Format:
Adobe Portable Document Format