Quality assessment of the Reuters vol. 2 Multilingual Corpus

dc.contributor.authorEriksson Robin
dc.contributor.organizationfi=kieli- ja puheteknologia|en=Language and Speech Technology|
dc.contributor.organization-code2606805
dc.converis.publication-id29505864
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/29505864
dc.date.accessioned2022-10-28T13:25:36Z
dc.date.available2022-10-28T13:25:36Z
dc.description.abstract<p>We introduce a framework for quality assurance of corpora, and apply it to the Reuters Multilingual Corpus (RCV2). The results of this quality assessment of this standard newsprint corpus reveal a significant duplication problem and, to a lesser extent, a problem with corrupted articles. From the raw collection of some 487,000 articles, almost one tenth are trivial duplicates. A smaller fraction of articles appear to be corrupted and should be excluded for that reason. The detailed results are being made available as on-line appendices to this article. This effort also demonstrates the beginnings of a constraint-based methodological framework for quality assessment and quality assurance for corpora. As a first implementation of this framework, we have investigated constraints to verify sample integrity, and to diagnose sample duplication, entropy aberrations, and tagging inconsistencies. To help identify near-duplicates in the corpus, we have employed both entropy measurements and a simple byte bigram incidence digest.<br /></p>
dc.format.pagerange1813
dc.format.pagerange1819
dc.identifier.isbn978-2-9517408-9-1
dc.identifier.olddbid182005
dc.identifier.oldhandle10024/165099
dc.identifier.urihttps://www.utupub.fi/handle/11111/39099
dc.identifier.urlhttp://www.lrec-conf.org/proceedings/lrec2016/index.html
dc.identifier.urnURN:NBN:fi-fe2021042718704
dc.language.isoen
dc.okm.affiliatedauthorEriksson, Robin
dc.okm.discipline222 Other engineering and technologiesen_GB
dc.okm.discipline222 Muu tekniikkafi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.relation.conferenceInternational Conference on Language Resources and Evaluation (LREC)
dc.source.identifierhttps://www.utupub.fi/handle/10024/165099
dc.titleQuality assessment of the Reuters vol. 2 Multilingual Corpus
dc.title.bookProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
dc.year.issued2016

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
214_Paper.pdf
Size:
170.62 KB
Format:
Adobe Portable Document Format
Description:
Publisher's PDF