Artigo Acesso aberto Produção Nacional Revisado por pares

Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese

2015; Springer Science+Business Media; Volume: 21; Issue: 1 Linguagem: Inglês

10.1186/s13173-014-0020-x

ISSN

1678-4804

Autores

Erick Fonseca, João Luís Garcia Rosa, Sandra Maria Aluísio,

Tópico(s)

Text Readability and Simplification

Resumo

Part-of-speech tagging is an important preprocessing step in many natural language processing applications. Despite much work already carried out in this field, there is still room for improvement, especially in Portuguese. We experiment here with an architecture based on neural networks and word embeddings, and that has achieved promising results in English. We tested our classifier in different corpora: a new revision of the Mac-Morpho corpus, in which we merged some tags and performed corrections and two previous versions of it. We evaluate the impact of using different types of word embeddings and explicit features as input. We compare our tagger's performance with other systems and achieve state-of-the-art results in the new corpus. We show how different methods for generating word embeddings and additional features differ in accuracy. The work reported here contributes with a new revision of the Mac-Morpho corpus and a state-of-the-art new tagger available for use out-of-the-box.

Referência(s)