Burrows‐Wheeler post‐transformation with effective clustering and interpolative coding

dc.contributor.authorArto Niemi
dc.contributor.authorJukka Teuhola
dc.contributor.organizationfi=tietojenkäsittelytiede|en=Computer Science|
dc.contributor.organizationfi=tietotekniikan laitos|en=Department of Computing|
dc.contributor.organization-code1.2.246.10.2458963.20.23479734818
dc.contributor.organization-code1.2.246.10.2458963.20.85312822902
dc.converis.publication-id48563979
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/48563979
dc.date.accessioned2022-10-27T11:58:43Z
dc.date.available2022-10-27T11:58:43Z
dc.description.abstract<p>Lossless compression methods based on the Burrows‐Wheeler transform (BWT) are regarded as an excellent compromise between speed and compression efficiency: they provide compression rates close to the PPM algorithms, with the speed of dictionary‐based methods. Instead of the laborious statistics‐gathering process used in PPM, the BWT reversibly sorts the input symbols, using as the sort key as many following characters as necessary to make the sort unique. Characters occurring in similar contexts are sorted close together, resulting in a clustered symbol sequence. Run‐length encoding and Move‐to‐Front (MTF) recoding, combined with a statistical Huffman or arithmetic coder, is then typically used to exploit the clustering. A drawback of the MTF recoding is that knowledge of the character that produced the MTF number is lost. In this paper, we present a new, competitive Burrows‐Wheeler posttransform stage that takes advantage of interpolative coding—a fast binary encoding method for integer sequences, being able to exploit clusters without requiring explicit statistics. We introduce a fast and simple way to retain knowledge of the run characters during the MTF recoding and use this to improve the clustering of MTF numbers and run‐lengths by applying reversible, stable sorting, with the run characters as sort keys, achieving significant improvement in the compression rate, as shown here by experiments on common text corpora.</p>
dc.format.pagerange1858
dc.format.pagerange1874
dc.identifier.eissn1097-024X
dc.identifier.jour-issn0038-0644
dc.identifier.olddbid173257
dc.identifier.oldhandle10024/156351
dc.identifier.urihttps://www.utupub.fi/handle/11111/31270
dc.identifier.urlhttps://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2873
dc.identifier.urnURN:NBN:fi-fe2021042612504
dc.language.isoen
dc.okm.affiliatedauthorTeuhola, Jukka
dc.okm.affiliatedauthorNiemi, Arto
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherJohn Wiley & Sons Ltd
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.doi10.1002/spe.2873
dc.relation.ispartofjournalSoftware: Practice and Experience
dc.relation.issue9
dc.relation.volume50
dc.source.identifierhttps://www.utupub.fi/handle/10024/156351
dc.titleBurrows‐Wheeler post‐transformation with effective clustering and interpolative coding
dc.year.issued2020

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
spe.2873.pdf
Size:
683.97 KB
Format:
Adobe Portable Document Format
Description:
Publisher's PDF