Capítulo de livro Revisado por pares

Trigram-Based Vietnamese Text Compression

2016; Springer Nature; Linguagem: Inglês

10.1007/978-3-319-31277-4_26

ISSN

1860-9503

Autores

Vu Nguyen, Hien Nguyen, Hieu N. Duong, Václav Snåšel,

Tópico(s)

Advanced Data Compression Techniques

Resumo

This paper presents a new and efficient method for text compression using tri-grams dictionary. There have been many methods proposed to text compression such as: run length coding, Huffman coding, Lempel-Ziv-Welch (LZW) coding. Most of them have based on frequency of occurrence of letters in the text. In this paper, we propose a method to compress text using tri-grams dictionary. Our method firstly splits text to tri-gram then we encode it based on tri-grams dictionary, with each tri-gram, we use 4 bytes to encode. In this paper, we use Vietnamese text to evaluate our method. We collect text corpus from internet to build tri-grams dictionary. The size of text corpus is around 2.15 GB and the number of tri-grams in dictionary is more than 74,400,000 tri-grams. To evaluate our method, we collect a testing set of 10 different text files with different sizes to test our system. Experimental results show that our method achieves better results with compression ratio around 82 %. In comparison with WinZIP version 19.5 ( http://www.winzip.com/win/en/index.htm ) (the software combines LZ77 (Ziv and Lempel in IEEE Trans Inf Theory 24(5), 530–536, 1978 [20]) and Huffman coding) and WinRAR version 5.21 ( http://www.rarlab.com/download.htm ) (the software combines LZSS (Storer and Szymanski in J ACM 29(4), 928–951, 1982 [17]) and Prediction by Partial Matching [2]), our method achieves a higher compression ratio applied for any size of text in our test cases.

Referência(s)