Artigo Acesso aberto

Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text

2015; Volume: 7; Issue: 2 Linguagem: Inglês

10.5121/ijcsit.2015.7204

ISSN

0975-4660

Autores

William J.Teahan, Khaled M.Alhawiti,

Tópico(s)

Natural Language Processing Techniques

Resumo

In this paper, several new universal preprocessing techniques are described to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text.These methods essentially adjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text.Firstly, a simple bigraphs (two-byte) substitution technique is described that leads to significant improvement in compression for many languages when they are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for Russian, 1% for Chinese text, and over 5% for both English and Welsh text).Secondly, a new preprocessing technique that outputs separate vocabulary and symbols streams -that are subsequently encoded separately -is also investigated.This also leads to significant improvement in compression for many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian).Finally, novel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text are described for dotted and non-dotted forms of the language.

Referência(s)