Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments)
2019; Springer Nature; Linguagem: Inglês
10.1007/978-3-030-21005-2_10
ISSN2190-3026
AutoresDamir Mukhamedshin, Dzhavdet Suleymanov, Olga Nevzorova,
Tópico(s)Cultural, Linguistic, Economic Studies
ResumoCorpus management systems are widely used to solve the problems of human-computer interaction. There are many developments associated with the management of language corpora, for example, Sketch Engine [1], Manatee [2], EXMARaLDA [3], etc. We developed the system which considers certain specific features of Turkic languages on the one hand and has new search functions and components from the other hand. The corpus management system "Tugan Tel" ( http://tugantel.tatar ) is specifically designed to work with the National Corpus of Tatar and can be used to work with both the linguistic corpora of Turkic languages and the corpora of other languages. The corpus management system developed by the authors allows searching of lexical units, morphological and lexical searching, searching of syntactic units, searching of the n-gram, named entity extraction and others. The semantic model of the Tatar language data representation is the core of the system. Storage and processing of corpus data, searching in corpus data are performed using open source tools (MariaDB DBMS, Redis data storage). There are three basic stages of corpus management search engine development: the data model development, the system architecture development, and the database architecture development. The issues of collecting and processing of corpus data should also be considered. The main task of our research is the identification and description of solutions for the corpus data storage, collection, and processing. The developed data model can be used for supervised and unsupervised document classification, as well as in corpus exploring. The proposed solutions have been implemented in the corpus management system which is currently used for data representation and processing for the National Corpus of Tatar "Tugan Tel".
Referência(s)