Capítulo de livro Revisado por pares

Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources

2017; Springer Science+Business Media; Linguagem: Inglês

10.1007/978-3-319-59268-8_8

ISSN

1611-3349

Autores

Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović,

Tópico(s)

Advanced Computational Techniques and Applications

Resumo

Large collections of textual documents represent an example of big data that requires the solution of three basic problems: the representation of documents, the representation of information needs and the matching of the two representations. This paper outlines the introduction of document indexing as a possible solution to document representation. Documents within a large textual database developed for geological projects in the Republic of Serbia for many years were indexed using methods developed within digital humanities: bag-of-words and named entity recognition. Documents in this geological database are described by a summary report, and other data, such as title, domain, keywords, abstract, and geographical location. These metadata were used for generating a bag of words for each document with the aid of morphological dictionaries and transducers. Named entities within metadata were also recognized with the help of a rule-based system. Both the bag of words and the metadata were then used for pre-indexing each document. A combination of several $$tf\_idf$$ based measures was applied for selecting and ranking of retrieval results of indexed documents for a specific query and the results were compared with the initial retrieval system that was already in place. In general, a significant improvement has been achieved according to the standard information retrieval performance measures, where the InQuery method performed the best.

Referência(s)