Quality assessment of the Reuters vol. 2 Multilingual Corpus
| dc.contributor.author | Eriksson Robin | |
| dc.contributor.organization | fi=kieli- ja puheteknologia|en=Language and Speech Technology| | |
| dc.contributor.organization-code | 2606805 | |
| dc.converis.publication-id | 29505864 | |
| dc.converis.url | https://research.utu.fi/converis/portal/Publication/29505864 | |
| dc.date.accessioned | 2022-10-28T13:25:36Z | |
| dc.date.available | 2022-10-28T13:25:36Z | |
| dc.description.abstract | <p>We introduce a framework for quality assurance of corpora, and apply it to the Reuters Multilingual Corpus (RCV2). The results of this quality assessment of this standard newsprint corpus reveal a significant duplication problem and, to a lesser extent, a problem with corrupted articles. From the raw collection of some 487,000 articles, almost one tenth are trivial duplicates. A smaller fraction of articles appear to be corrupted and should be excluded for that reason. The detailed results are being made available as on-line appendices to this article. This effort also demonstrates the beginnings of a constraint-based methodological framework for quality assessment and quality assurance for corpora. As a first implementation of this framework, we have investigated constraints to verify sample integrity, and to diagnose sample duplication, entropy aberrations, and tagging inconsistencies. To help identify near-duplicates in the corpus, we have employed both entropy measurements and a simple byte bigram incidence digest.<br /></p> | |
| dc.format.pagerange | 1813 | |
| dc.format.pagerange | 1819 | |
| dc.identifier.isbn | 978-2-9517408-9-1 | |
| dc.identifier.olddbid | 182005 | |
| dc.identifier.oldhandle | 10024/165099 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/39099 | |
| dc.identifier.url | http://www.lrec-conf.org/proceedings/lrec2016/index.html | |
| dc.identifier.urn | URN:NBN:fi-fe2021042718704 | |
| dc.language.iso | en | |
| dc.okm.affiliatedauthor | Eriksson, Robin | |
| dc.okm.discipline | 222 Other engineering and technologies | en_GB |
| dc.okm.discipline | 222 Muu tekniikka | fi_FI |
| dc.okm.internationalcopublication | not an international co-publication | |
| dc.okm.internationality | International publication | |
| dc.okm.type | A4 Conference Article | |
| dc.relation.conference | International Conference on Language Resources and Evaluation (LREC) | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/165099 | |
| dc.title | Quality assessment of the Reuters vol. 2 Multilingual Corpus | |
| dc.title.book | Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) | |
| dc.year.issued | 2016 |
Tiedostot
1 - 1 / 1
Ladataan...
- Name:
- 214_Paper.pdf
- Size:
- 170.62 KB
- Format:
- Adobe Portable Document Format
- Description:
- Publisher's PDF