TCBLex - A lexical database of Finnish literary texts for children

dc.contributor.authorNojonen, Tapio
dc.contributor.authorKorsu, Kiia
dc.contributor.authorGinter, Filip
dc.contributor.authorLaippala, Veronika
dc.contributor.authorKanerva, Jenna
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organizationfi=fysiologia ja genetiikka|en=Physiology and Genetics|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code1.2.246.10.2458963.20.70712835001
dc.converis.publication-id504652992
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/504652992
dc.date.accessioned2026-01-21T12:20:36Z
dc.date.available2026-01-21T12:20:36Z
dc.description.abstract<p>This work introduces TCBLex, a lexical database of Finnish literary works read by children between the ages of 7 and 15. We explain in detail the work done to build the corpus TCBLex is based on, including how books were sampled and collected, turned into text files, and finally processed. We also touch on legal considerations and how it is possible to build such a corpus in the EU. TCBLex contains over 11 million tokens that are annotated with parts-of-speech tags and lemmatized. We provide 14 different sub-lexicons in total, covering individual intended reading ages, age groups, as well as different genres. We also provide versions with additional morphological information, such as the cases and tenses of words. TCBLex provides various psycholinguistically interesting lexical statistics for both word types and lemmas, such as different frequency metrics, distributions, word lengths, numbers of syllables, morphological paradigm sizes, and for the first time in a Finnish lexicon, ages when words and lemmas are first encountered in books. TCBLex is freely available at <a href="https://doi.org/10.5281/zenodo.15655580">https://doi.org/10.5281/zenodo.15655580</a>.<br></p>
dc.identifier.eissn1554-3528
dc.identifier.jour-issn1554-351X
dc.identifier.olddbid212364
dc.identifier.oldhandle10024/195382
dc.identifier.urihttps://www.utupub.fi/handle/11111/51607
dc.identifier.urlhttps://doi.org/10.3758/s13428-025-02832-x
dc.identifier.urnURN:NBN:fi-fe202601215792
dc.language.isoen
dc.okm.affiliatedauthorNojonen, Tapio
dc.okm.affiliatedauthorKorsu, Kiia
dc.okm.affiliatedauthorGinter, Filip
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorKanerva, Jenna
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline515 Psychologyen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline515 Psykologiafi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherSpringer Science and Business Media LLC
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.articlenumber312
dc.relation.doi10.3758/s13428-025-02832-x
dc.relation.ispartofjournalBehavior Research Methods
dc.relation.volume57
dc.source.identifierhttps://www.utupub.fi/handle/10024/195382
dc.titleTCBLex - A lexical database of Finnish literary texts for children
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
s13428-025-02832-x.pdf
Size:
807.31 KB
Format:
Adobe Portable Document Format