
Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese
2015; Springer Science+Business Media; Volume: 21; Issue: 1 Linguagem: Inglês
10.1186/s13173-014-0020-x
ISSN1678-4804
AutoresErick Fonseca, João Luís Garcia Rosa, Sandra Maria Aluísio,
Tópico(s)Text Readability and Simplification
ResumoPart-of-speech tagging is an important preprocessing step in many natural language processing applications. Despite much work already carried out in this field, there is still room for improvement, especially in Portuguese. We experiment here with an architecture based on neural networks and word embeddings, and that has achieved promising results in English. We tested our classifier in different corpora: a new revision of the Mac-Morpho corpus, in which we merged some tags and performed corrections and two previous versions of it. We evaluate the impact of using different types of word embeddings and explicit features as input. We compare our tagger's performance with other systems and achieve state-of-the-art results in the new corpus. We show how different methods for generating word embeddings and additional features differ in accuracy. The work reported here contributes with a new revision of the Mac-Morpho corpus and a state-of-the-art new tagger available for use out-of-the-box.
Referência(s)